base:speedcode
no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
— | base:speedcode [2015-04-17 04:33] (current) – created - external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Speedcode a.k.a. Loop Unrolling ====== | ||
+ | |||
+ | Written by Cruzer/CML | ||
+ | |||
+ | ===== Intro ===== | ||
+ | |||
+ | One of the earliest optimization tricks invented was loop unrolling, aka. speedcode. It was probably first done to get the most rastersplits on the same line, and then later utilized to break DYCP records, etc. The idea is that instead of a loop, you " | ||
+ | |||
+ | This loop clears 10 chars, which it uses 103 cycles for: | ||
+ | |||
+ | < | ||
+ | lda #0 | ||
+ | ldx #9 | ||
+ | loop: sta screen,x | ||
+ | dex | ||
+ | bpl loop | ||
+ | </ | ||
+ | |||
+ | If we unroll the loop it can be done in only 42 cycles: | ||
+ | |||
+ | < | ||
+ | lda #0 | ||
+ | sta screen+0 | ||
+ | sta screen+1 | ||
+ | sta screen+2 | ||
+ | sta screen+3 | ||
+ | sta screen+4 | ||
+ | sta screen+5 | ||
+ | sta screen+6 | ||
+ | sta screen+7 | ||
+ | sta screen+8 | ||
+ | sta screen+9 | ||
+ | </ | ||
+ | |||
+ | This might be counterintuitive at first, since the latter piece of code is bigger, but the fastness comes from the fact that it only has to be executed once, where the first one loops 10 times. The drawback is that it takes up a lot of memory, but fortunately there can be a lot of speedcode in 64K, so no need to worry about that for now. Another disadvantage is that it's harder to write and read afterwards. | ||
+ | |||
+ | Here's another a bit more advanced example - an 8x8 plasma, which is not only ugly but also slow because it relies on looping. This means it can only be 20x20 chars big if it has to run oneframed. | ||
+ | |||
+ | [[8x8-plasma-looped|{{: | ||
+ | |||
+ | [[8x8-plasma-looped|Sourcecode for 8x8 Plasma Without Speedcode]] | ||
+ | |||
+ | In the coming chapters I will use the same routine to show some different ways of optimizing it. | ||
+ | |||
+ | ===== The slave method ===== | ||
+ | |||
+ | The most braindead way of making speedcode is to type it all in by hand, which is how I did it back in the good old lamerdays. :-) This took quite some time, but with help from copy/paste, a little exercise and some nice pumping music on the ghettoblaster, | ||
+ | |||
+ | Here is the plasma routine, now with handcoded speedcode. As you can see this had a drastic effect on the size that could be rendered in one frame, which went from 20x20 to a full 40x25 chars screen. | ||
+ | |||
+ | [[8x8-plasma-slave-speedcode|{{: | ||
+ | |||
+ | [[8x8-plasma-slave-speedcode|8x8 Plasma with " | ||
+ | |||
+ | ===== Making the computer code for you ===== | ||
+ | |||
+ | Typing it all by hand quickly became boring, so the next step was to automate the process. The first idea I had was to make a Basic program to generate the code. This was quite an improvement, | ||
+ | |||
+ | [[8x8-plasma-basic|8x8 Plasma with Basic generated speedcode]] | ||
+ | |||
+ | |||
+ | ===== Scripting/ | ||
+ | |||
+ | This is something I haven' | ||
+ | |||
+ | The disadvantage of this and the other above mentioned methods is that the speedcode fills up a lot of space on the disk, and therefore takes a long time to load. It's also very static, since you only have one version of the effect, with no option of changing it at runtime. | ||
+ | |||
+ | You could of course make a runtime speedcode updater to change the params in the speedcode, but that's almost as advanced as a runtime generator, and it only helps in cases where the code repeats in predictable patterns. If there' | ||
+ | |||
+ | So in my opinion scripting is only good for proof of concept, except maybe in rare cases where the code is so advanced that would make the runtime code generator take longer than loading the speedcode from disk. | ||
+ | |||
+ | [[8x8-plasma-scripted|8x8 Plasma With Scripted Speedcode]] | ||
+ | |||
+ | ===== Runtime Code Generators ===== | ||
+ | |||
+ | And now for the real way of doing it - on the C64, in machine code. That way you only need to load the small generator routine, which means faster loading time, and space for more parts on the disk. It also means that you can more easily switch between different variants of the effect, or even fit several different effects into the memory at once, as seen in onefilers like Dawnfall/ | ||
+ | |||
+ | The disadvantage is that it's harder to do complex logic in assembly than in a high level language. But it isn't that complicated actually - basically you just need to copy the same piece of code out to memory a number of times, with a little change for every iteration. The changes can be applied either by calculating stuff, e.g. multiplying the X and Y positions with sine spreadings, or incrementally, | ||
+ | |||
+ | Let's look at an algorithm for generating our simple char plasma: | ||
+ | * For all Y positions (lines) | ||
+ | * Init params/code sources for the current line | ||
+ | * Copy init code to destination | ||
+ | * For all X positions on the line | ||
+ | * Copy plasmer chunk to destination | ||
+ | * Update sine load addresses + store address in the code source | ||
+ | * Update sine load addresses in init chunk | ||
+ | |||
+ | [[8x8-plasma-codegen|8x8 Plasma w/ Generated Speedcode]] - As you can see when assembling it, the code now only takes about $140 bytes, as opposed to over $3000 with the previous versions. And the init time isn't too bad either - about a 3rd of a second, which definitely wouldn' | ||
+ | |||
+ | ===== Optimizing the Speedcode Generator ===== | ||
+ | |||
+ | If it takes too long to generate the code and you crave some more pace for your Edge of Disgrace-beating killerdemo, the generator can of course also be optimized. In our plasma case the speedcode is quite small and simple, which means the generator is already so fast that it would be hard to notice any improvement of the init time. But if we switch between some different variants of the effect while running, the pause gets more noticeable. So here's [[8x8-plasma-effect-switch|8x8 Plasma w/ Effect Switching]] and no optimizing. | ||
+ | |||
+ | **Using Speedcode to Generate Speedcode** | ||
+ | |||
+ | The code generating loops can of course be unrolled like any other loop. This doesn' | ||
+ | |||
+ | Here's the [[8x8-plasma-optimized-codegen|8x8 Plasma w/ Optimized Code Generator]], | ||
+ | |||
+ | **Simplifying the Mess with Scripting** | ||
+ | |||
+ | Before complicating the code further, let's simplify it a bit with some of KickAss' | ||
+ | |||
+ | The main pseudocommand is ": | ||
+ | |||
+ | Guess it might be possible to simplify it further. Basically what we need is just to define the code segments and how to change them for each iteration, and from this information it should theoretically be possible to generate a codegenerator. But the danger of a more generic approach is always that it becomes slower, bigger and harder to finetune. With the approach above I haven' | ||
+ | |||
+ | **Gaining Further Speed with Code Updating** | ||
+ | |||
+ | The pause between effect variants could be further reduced if we added a code updater, that only changed the stuff that needed to be changed, instead of regenerating all the code every time. However, this only works if the code has the exact same structure for every variant of the effect, but luckily it does in our case. | ||
+ | |||
+ | The code generator can be reused if we add two different modes - " | ||
+ | [[8x8-plasma-updater|8x8 Plasma w/ Code Updater]] | ||
+ | |||
+ | ===== What to do if running short on memory? ===== | ||
+ | |||
+ | For some effects, especially ones that take more than one frame to complete, or that require lots of lookup-tables, | ||
+ | |||
+ | It's always a good practice is to keep a meticulous memory map at the top of your source code, and to keep it updated, e.g. by regularly taking a tour of the memory with a monitor, and checking that things fill up as much as you think. Also remember that memory can be reused again and again - for instance there might be some code and data used for initing, which isn't needed when the effect is running. So there' | ||
base/speedcode.txt · Last modified: 2015-04-17 04:33 by 127.0.0.1