Loops vs unrolled loops
Often we got tought to unroll loops to save on the overhead a loop gives us by having to decrease a counter and involving another branch. But there are situation where a loop can perform way faster, as we can set up values directly via code modification. A good example is a line algorithm.
Here we need to subtract for e.g. dx from A and in case of underrun add dy to A and advance the x-position. On every change in y-direction we also want to plot.
This could look like:
back tax lda pix ora (dst),y sta (dst),y dey bmi out txa sbc dx bcs back move_x adc dy asl pix bcc back tax lda #$80 eor dst sta dst bmi back+1 inc dst+1 bne back+1 out rts
Now if we unroll the main loop, we would get:
back tax lda pix ora (dst),y sta (dst),y dey ... txa sbc dx bcs back move_x
This means we would invest 25 cycles if we neglect the cycles needed for moving in x-direction. Now let us do the same as loop again, but let us set up dst, dx, pix and dy directly:
back tax pix lda #$00 dst1 ora $2000,y dst2 sta $2000,y dey bmi out txa dx sbc #$00 bcs back move_x
As you see, all of a sudden we need 24 cycles per run, so the loop is faster! Why not setting up the immediate values within the speedcode you might think? Well, this means, that at a minimum, you waste another 4 cycles per loop run and value to be set up, while in our case we just waste an initial 4 cycles per value, what is pretty fair.
Even more, now the loop variant of our code gives us better access to illegal opcodes as some of them work with immediate values only, like the SBX command:
back pix lda #$00 dst1 ora $2000,y dst2 sta $2000,y dey bpl out txa dx sbx #$00 ;now we get the value of A transfered to X for free after subtraction ;and A is free again for other purposes bcs back move_x
We now end up with 22 cycles per run and just a few bytes of code. So as you see, sometimes it is also worth trying to optimize a loop before brainlessly unrolling everything
Now as our code shrunk to a reasonable size, one could also think of copying that code to zeropage once and thus speed up the further code manipulation happening when setting up the loop and when executing the code in move_x.