Loops vs unrolled loops

Often we got tought to unroll loops to save on the overhead a loop gives us by having to decrease a counter and involving another branch. But there are situation where a loop can perform way faster, as we can set up values directly via code modification. A good example is a line algorithm.

Here we need to subtract for e.g. dx from A and in case of underrun add dy to A and advance the x-position. On every change in y-direction we also want to plot.

This could look like:

back
         tax
         lda pix
         ora (dst),y
         sta (dst),y
         dey
         bmi out
         txa
         sbc dx
         bcs back

move_x
         adc dy

         asl pix
         bcc back

         tax
         lda #$80
         eor dst
         sta dst
         bmi back+1
         inc dst+1
         bne back+1
out
         rts

Now if we unroll the main loop, we would get:

back
         tax
         lda pix
         ora (dst),y
         sta (dst),y
         dey
         ...
         
         txa
         sbc dx
         bcs back
move_x

This means we would invest 25 cycles if we neglect the cycles needed for moving in x-direction. Now let us do the same as loop again, but let us set up dst, dx, pix and dy directly:

back
         tax
pix      lda #$00
dst1     ora $2000,y
dst2     sta $2000,y
         dey
         bmi out

         txa
dx       sbc #$00
         bcs back
move_x

As you see, all of a sudden we need 24 cycles per run, so the loop is faster! Why not setting up the immediate values within the speedcode you might think? Well, this means, that at a minimum, you waste another 4 cycles per loop run and value to be set up, while in our case we just waste an initial 4 cycles per value, what is pretty fair.

Even more, now the loop variant of our code gives us better access to illegal opcodes as some of them work with immediate values only, like the SBX command:

back
pix      lda #$00
dst1     ora $2000,y
dst2     sta $2000,y
         dey
         bpl out

         txa
dx       sbx #$00         ;now we get the value of A transfered to X for free after subtraction 
                          ;and A is free again for other purposes
         bcs back
move_x

We now end up with 22 cycles per run and just a few bytes of code. So as you see, sometimes it is also worth trying to optimize a loop before brainlessly unrolling everything

Now as our code shrunk to a reasonable size, one could also think of copying that code to zeropage once and thus speed up the further code manipulation happening when setting up the loop and when executing the code in move_x.