base:lines

# Lines

by Bitbreaker / Oxyron ^ Arsenic ^ Nuance

Lines can be either calculated using the Bresenham's algorithm, but also by using precalculated slopes. When the lines being used are all short, the number of needed slopes is rather low and memory consumption is less than one might expect. There's also other ways of building lines, like with precalculated tiles or such.

Here's the sourcecode to play around and spoil the experience of doing it all on your own

Note: To start your compiled line.prg, type SYS 14336 (= \$3800). Program doesn't start automatically.

# Bresenham

Usually one has to split linedrawing into 4 cases, for that one has to swap coordinates first to achieve that the line end is always lower in y position than the line start. That means we then have lines that will be drawn from bottom to top of the screen. By comparing the dx (x2-x1) and dy (y2-y1) values we can find out if we draw a steep (abs(dx) < dy) or a flat slope (abs(dx) > dy). Thus we end up with those 4 cases:

• flat slope with decrementing y and incrementing on x
• flat slope with decrementing y and decrementing on x
• steep slope with decrementing y and incrementing on x
• steep slope with decrementing y and decrementing on x

The versions for decrementing/incrementing x only differ in the x handling, so after all we have two algorithms that we use, one specialized on steep slopes and one for the flat slopes.

So for steep slopes we always take one y step per pixel, but when our error value overflows we take also a step in x direction. For that we use a mask to cover 8 x steps. So a stepping for e.g. one pixel to the right is just a shift on the mask with lsr mask. The good thing is, when the pixel falls out on the right side, the carry will be set, and thus we know when we bump on the end of our column that we draw in, no matter if we draw on a hires screen or a 16×16 charset grid. In that case we have to do further steps, like adjusting the target where we write the pixels to, as well as resetting the mask to \$80 in that case.

## Pixel collecting on flat slopes

As for the flat slopes we take a x step for every pixel, but only occasionally do a step in y direction. We could of course add each pixel immedeatedly to the respective memory location, but we could also just remmeber the pixels first and only write them through if we either change column or y-position. Therefore we do the following (x is incrementing in this case):

First, we split up the line into full columns that it covers, the end chunk needs some extra handling. Now we take one column by another to be drawn over. If we start with the leftmost pixel in a column, we start with a mask value of \$ff. We keep that mask until we either do a step in y-direction or leave the column. If we leave the column without having had a y-step, we write out \$ff and have indeed set 8 pixels with one write. If we have a y change after say 4 pixels, we take the mask, and do a and #\$f0 on it and write through 4 pixels. Sounds complex, but as soon as we unroll the code and handle 8 pixels by such an unrolled loop, the initial mask values and values for the and operation are fix.

Means we would have code like this:

lda err
sbc dy
pixel0
bcc step0  ;underrun @ x & 7 == 0 -> fix y
sbc dy
pixel1
bcc step1  ;underrun @ x & 7 == 1 -> fix y
sbc dy
pixel2
bcc step2  ;underrun @ x & 7 == 2 -> fix y
sbc dy
pixel3
bcc step3  ;underrun @ x & 7 == 3 -> fix y
sbc dy
pixel4
bcc step4  ;underrun @ x & 7 == 4 -> fix y
sbc dy
pixel5
bcc step5  ;underrun @ x & 7 == 5 -> fix y
sbc dy
pixel6
bcc step6  ;underrun @ x & 7 == 6 -> fix y
sbc dy
pixel7
bcc step7  ;underrun @ x & 7 == 7 -> fix y

;no underrun after 8 pixels, write out \$ff
lda #\$ff
sta (dst),y
...

step0
sta err
and #\$80      ;mask out pixels that need to be written
ora (dst),y
sta (dst),y
lda err       ;restore err
dey           ;take y step
ldx #\$7f
jmp pixel1
step1
sta err
and #\$c0      ;mask out pixels that need to be written
ora (dst),y
sta (dst),y
lda err       ;restore err
dey           ;take y step
ldx #\$3f
jmp pixel2
step2
...

Now when having all unrolled you see that there are several points for optimizations. The case for step0 can be optimized to the following, as in this case always just a single pixel will be set:

tax
lda #\$80      ;mask out pixels that need to be written
ora (dst),y
sta (dst),y
txa           ;restore err
dey           ;take y step
ldx #\$7f
jmp pixel1

Furthermore one can save on the jump and duplicate code:

tax
lda #\$80      ;mask out pixels that need to be written
ora (dst),y
sta (dst),y
txa           ;restore err
dey           ;take y step
ldx #\$7f
sbc dy        ;check for underrun
bcs pixel1    ;back to main loop or take another step in y-direction
step1

Third step on optimization is, that one can in that case aggregate the adc dx and sbc dy:

tax
lda #\$80      ;mask out pixels that need to be written
ora (dst),y
sta (dst),y
txa           ;restore err
dey           ;take y step
ldx #\$7f
sbc dxdy      ;check for underrun. dxdy = dx + 1 - dy (carry is cleared and dy will subtract one too much then again)
bcs pixel1    ;back to main loop or take another step in y-direction
step1

Now seeing that kind of pixel collection one might even go one step further. Actually we can maximum draw 8 pixels until we bump upon a columnchange. That means we can only have 8+7+6+5+4+3+2+1 = 36 different patterns, so why not working out all permutations in code and by that even save on that masking stuff?

f8_new_column_y
ora (dst),y
sta (dst),y
dey
jmp +

;entry point for no y change
f8_new_column
ora (dst),y
sta (dst),y
+
tya
eor #\$80
tay
bpl +
dec dst+1
+
;... we would also need to check how many full column need to be handled
;... so here'd bea good place to do so

;8 pixels to go
f8_entry
txa
sbc dy
bcc f8_exit_1
sbc dy
bcc f8_exit_2
sbc dy
bcc f8_exit_3
sbc dy
bcc f8_exit_4
sbc dy
bcc f8_exit_5
sbc dy
bcc f8_exit_6
sbc dy
bcc f8_exit_7
sbc dy
bcc f8_exit_8
tax
lda #\$ff
jmp f8_new_column+2   ;nothing to ora, as all pixels will be set anyway!
f8_exit_8
tax
lda #\$ff
jmp f8_new_column_y+2 ;same as above
f8_exit_2
tax
lda #\$03
jmp f6_entry_y
f8_exit_3
tax
lda #\$07
jmp f5_entry_y
f8_exit_4
tax
lda #\$0f
jmp f4_entry_y
f8_exit_5
tax
lda #\$1f
jmp f3_entry_y
f8_exit_6
tax
lda #\$3f
jmp f2_entry_y
f8_exit_7
tax
lda #\$7f
jmp f1_entry_y
f8_exit_1
tax
lda #\$01        ;perfect, no jmp needed, just slip through in this case

f7_entry_y
ora (dst),y
sta (dst),y
dey
;7 pixels to go
f7_entry
txa
bcc f7_exit_1
sbc dy
bcc f7_exit_2
sbc dy
bcc f7_exit_3
sbc dy
bcc f7_exit_4
sbc dy
bcc f7_exit_5
sbc dy
bcc f7_exit_6
sbc dy
bcc f7_exit_7
tax
lda #\$fe
jmp f8_new_column
f7_exit_7
tax
lda #\$fe
jmp f8_new_column_y
f7_exit_2
tax
lda #\$06
jmp f5_entry_y
f7_exit_3
tax
lda #\$0e
jmp f4_entry_y
f7_exit_4
tax
lda #\$1e
jmp f3_entry_y
f7_exit_5
tax
lda #\$3e
jmp f2_entry_y
f7_exit_6
tax
lda #\$7e
jmp f1_entry_y
f7_exit_1
tax
lda #\$02

f6_entry_y
ora (dst),y
sta (dst),y
dey
;6 pixels to go
f6_entry
txa
bcc f6_exit_1
sbc dy
bcc f6_exit_2
sbc dy
bcc f6_exit_3
sbc dy
bcc f6_exit_4
sbc dy
bcc f6_exit_5
sbc dy
bcc f6_exit_6
tax
lda #\$fc
jmp f8_new_column
f6_exit_6
tax
lda #\$fc
jmp f8_new_column_y

f6_exit_2
tax
lda #\$0c
jmp f4_entry_y
f6_exit_3
tax
lda #\$1c
jmp f3_entry_y
f6_exit_4
tax
lda #\$3c
jmp f2_entry_y
f6_exit_5
tax
lda #\$7c
jmp f1_entry_y
f6_exit_1
tax
lda #\$04

f5_entry_y
ora (dst),y
sta (dst),y
dey
;5 pixels to go
f5_entry
txa
bcc f5_exit_1
sbc dy
bcc f5_exit_2
sbc dy
bcc f5_exit_3
sbc dy
bcc f5_exit_4
sbc dy
bcc f5_exit_5
tax
lda #\$f8
jmp f8_new_column
f5_exit_5
tax
lda #\$f8
jmp f8_new_column_y

f5_exit_2
tax
lda #\$18
jmp f3_entry_y
f5_exit_3
tax
lda #\$38
jmp f2_entry_y
f5_exit_4
tax
lda #\$78
jmp f1_entry_y
f5_exit_1
tax
lda #\$08

f4_entry_y
ora (dst),y
sta (dst),y
dey
;4 pixels to go
f4_entry
txa
bcc f4_exit_1
sbc dy
bcc f4_exit_2
sbc dy
bcc f4_exit_3
sbc dy
bcc f4_exit_4
tax
lda #\$f0
jmp f8_new_column
f4_exit_4
tax
lda #\$f0
jmp f8_new_column_y

f4_exit_2
tax
lda #\$30
jmp f2_entry_y
f4_exit_3
tax
lda #\$70
jmp f1_entry_y
f4_exit_1
tax
lda #\$10

f3_entry_y
ora (dst),y
sta (dst),y
dey
;3 pixels to go
f3_entry
txa
bcc f3_exit_1
sbc dy
bcc f3_exit_2
sbc dy
bcc f3_exit_3
tax
lda #\$e0
jmp f8_new_column
f3_exit_3
tax
lda #\$e0
jmp f8_new_column_y

f3_exit_2
tax
lda #\$60
jmp f1_entry_y
f3_exit_1
tax
lda #\$20

f2_entry_y
ora (dst),y
sta (dst),y
dey
;2 pixels to go
f2_entry
txa
bcc f2_exit_1
sbc dy
bcc f2_exit_2
tax
lda #\$c0
jmp f8_new_column
f2_exit_2
tax
lda #\$c0
jmp f8_new_column_y
f2_exit_1
tax
lda #\$40

f1_entry_y
ora (dst),y
sta (dst),y
dey
;1 pixel to go
f1_entry
txa
bcc f1_exit_1
tax
lda #\$80
jmp f8_new_column
f1_exit_1
tax
lda #\$80
jmp f8_new_column_y

If we precalculate the linecoordinates beforehand, we can now even give each line a flag if it collides with any other line, else we can even forgo on the ora (dst),y component! depending on the first pixel x-position we just enter that code depending on x & 7 on either f8_entry, f7_entry, f6_entry, …

## Steep slopes

Examples for the steep slopes can be found here. Also one can unroll things in the same manner as for the flat slopes regarding permutations, just that we aim for different run lengths and not for patterns.

### With a loop

One can use the advantage of some illegal opcodes to make a looped version nearly as fast as an unrolled variant. In fact the inner loop here is one cycle faster then the method being used in The Masque. However the overhead that arises from shifting the pixelmask and advancing the pointers dst1 + dst2 is wasting more cycles then with an unrolled line drawing routine. So if you want it fast but also need to save memory, this is what you want:

back_
back
pix      lda #\$00
dst1     ora \$2000,y
dst2     sta \$2000,y
dey
bpl out          ;anoying, but needed in a loop variant

txa
dx       sbx #\$00
bcs back
move_x
txa
dy       sbx #\$00         ;add dy (by subtracting -dy)
lda (dst1),y
bcc dst2         ;column change? no -> do it the short way
lda #\$80
eor dst1+1
sta dst1+1
sta dst2+1
bmi back_
inc dst1+2
inc dst2+2
lda #\$01
bne back+2

Needless to say that this code is best placed into zeropage for maximum performance. The addition of dy and the next upcoming subtract can again be aggregated and additional cycles can be saved that way.

### Unrolled

Unrolling the above code brings two advantages: One does not have to check for the underrun of Y, and one can use absolute adressing. The tradeoff: sbx can't be used anymore and one needs to use sbc dx/adc dy to calculate the slope, what wastes more cycles. Also one needs to find the sweet spot between saved cycles and memory consumption.

-
tax
lda #\$01
ora \$2780,y
sta \$2780,y
dey
txa
sbc dx
bcs -          ;2+2+4+5+2+2+3+3 = 23 cycles per pixel
-
tax
lda #\$02
ora \$2780,y
sta \$2780,y
dey
txa
sbc dx
bcs -
...

Memory consumption is 2176 bytes (17 bytes * 8 pixels * 16 columns). However this method forces us to do an extra handling for the last x-position on each line, to land on the right Y value.

If you can afford a lot of memory you might think of a fully unrolled variant that unrolles each possible column:

tay
lda \$277f
sax \$277f
tya
sbc dx
bcs ++             ;ossom 18 cycles per pixel!
tay
txa
rol
bcs +
rol
jmp column15_y7e   ;change column, jump into unrolled loop of adjacent column
+
tax
++
tay
lda \$277e
sax \$277e
tya
sbc dx
bcs ++
tay
txa
rol
bcs +
rol
jmp column15_y7d
+
tax
++
...

This method would work on inverted graphics, means it does not plot but delete pixels, thus we can make use of the SAX command, but for that have to get rid of the y-index. However memory consumption is tremendous now: 24 * 16 * 128 = \$c000 bytes. However there's a lot of dead code there. Especially at the outer columns there's no need to unroll all 128 lines, as you object will rotate and draw within a circle, it is enough to only support those bytes that are within that circle, no lines will be drawn outside of it. So it would be around 4 / PI * \$c000 bytes.

If you dare to flip x and y axis on the 16×16 grid, also the following would be possible:

loop_14_bit1_y7
tax
lda #\$01
ora \$2707,y
sta \$2707,y
txa
sbc dy
bcc +
loop_14_bit1_y6
tax
lda #\$01
ora \$2706,y
sta \$2706,y
txa
sbc dy
bcc ++
loop_14_bit1_y5
tax
lda #\$01
ora \$2705,y
sta \$2705,y
txa
sbc dy
bcc +++
loop_14_bit1_y4
tax
lda #\$01
ora \$2704,y
sta \$2704,y
txa
sbc dy
bcc ++++
loop_14_bit1_y3
tax
lda #\$01
ora \$2703,y
sta \$2703,y
txa
sbc dy
bcc +++++
loop_14_bit1_y2
tax
lda #\$01
ora \$2702,y
sta \$2702,y
txa
sbc dy
bcc ++++++
loop_14_bit1_y1
tax
lda #\$01
ora \$2701,y
sta \$2701,y
txa
sbc dy
bcc +++++++
loop_14_bit1_y0
tax
lda #\$01
ora \$2700,y
sta \$2700,y
txa
sbc dy
bcc ++++++++
tax
tya
eor #\$80
tay
bpl loop_14_bit1_y7+1
jmp loop_12_bit1_y7+1

+
jmp loop_14_bit2_y6
++
jmp loop_14_bit2_y5
+++
jmp loop_14_bit2_y4
++++
jmp loop_14_bit2_y3
+++++
jmp loop_14_bit2_y2
++++++
jmp loop_14_bit2_y1
+++++++
jmp loop_14_bit2_y0
++++++++
tax
tya
eor #\$80
tay
bpl loop_14_bit2_y7+1
jmp loop_12_bit2_y7+1

loop_14_bit2_y7
tax
lda #\$02
ora \$2707,y
sta \$2707,y
txa
sbc dy
bcc +

;...

loop_12_bit1_y7
tax
lda #\$01
ora \$2607,y
sta \$2607,y
txa
sbc dy
bcc +

;...

Now you can unroll the loop for 16 (or better 8, if you choose to cover 2 rows by using Y & \$7f or Y | \$80) rows of chars and handle 8 lines per charline continously. Also the code needs to be unrolled per X shift. Whenever the pixelmask wraps around, one needs to subtract another 8 from Y, not too much of a compelx handling, but therefore with somewhat 20 cycles for setting a pixel. The flat version will therefore need some etxtra handling on changing a charrow, but as that is happening rather seldom it should not matter much.

# Run length slice

Seeing the depicted line, one can notice that there's either 2 or 3 pixels set. So when doing an abs(dy/dx) we get the minimum number of pixels being set. So if we have a steep line with dx = 19 and dy = 80 we would always draw either 4 or 5 consecutive pixels before a x-step would occur. However dividing is expensive, but one could stick to divide into cases that fit within a power of 2. The decision would work like this:

lda dx
asl
cmp dy
bcs do_1_or2
asl
cmp dy
bcs do_2_or3
asl
cmp dy
bcs do_4_or5
asl
cmp dy
bcs do_8_or9
...

As for the unrolled loop you could do something like:

!align 255,0
entry_9
lda d_pixm
ora (d_dst1),y
sta (d_dst1),y
dey
bmi dout
entry_8
lda d_pixm
ora (d_dst1),y
sta (d_dst1),y
dey
bmi dout
entry_7
lda d_pixm
ora (d_dst1),y
sta (d_dst1),y
dey
bmi dout
entry_6
lda d_pixm
ora (d_dst1),y
sta (d_dst1),y
dey
bmi dout
entry_5
lda d_pixm
ora (d_dst1),y
sta (d_dst1),y
dey
bmi dout
entry_4
lda d_pixm
ora (d_dst1),y
sta (d_dst1),y
dey
bmi dout
entry_3
lda d_pixm
ora (d_dst1),y
sta (d_dst1),y
dey
bmi dout
entry_2
lda d_pixm
ora (d_dst1),y
sta (d_dst1),y
dey
bmi dout
txa
dx_mul    sbx #\$00
bcc +         ;all done, or is another single one needed?
entry_1
lda d_pixm
ora (d_dst1),y
sta (d_dst1),y
dey
bmi dout
txa
dx_sin    sbx #\$00
bcs entry_1
+
bcc +

;change column
rol d_pixm
lda #\$80
eor d_dst1
sta d_dst1
bpl +
dec d_dst1+1
+
txa
tax
jmp_e     jmp dstart   ;will be modified to point to entry_* depending on how many pixel in a row need to be set

dout
rts

Now with the dependencies of the run length slice method in mind, one can also improve the flat line algorithm mentioned before again. Taking f8_exit_4 as an example, we can assume that we successfully drew 4 pixels. That means that 4 pixels is the minium or maximum pixels to draw. Thus we are safe to aggregate the sbc for the next upcoming three pixels by doing:

f8_exit_4
sbc dy          ;still no underrun will occur
tax
lda #\$0f
ora (dst),y
sta (dst),y
dey
txa
jmp f4_entry+9  ;3rd sbc there + branch

The case f8_exit_5 makes it even easier for us, as we can finish the column at the same time:

f8_exit_5
sbc dy
sbc dy
tax
lda #\$1f
ora (dst),y
sta (dst),y
dey
lda #\$e0
jmp f8_new_column

As you can see, we can aggregate the subtractions already beforehand and plot the next and last chunk within this column.

To some amount the runlength slice paradigm could also be applied to certain parts of the steep slopes. Imagine you calculate the lines within blocks of 8×8 pixels. On steep slopes thus, maximum 8 x steps can occur during 8 y steps. Thus, when our slope once leaves the current column to the left or right, on the next column one can forgo on the column checks, as the column will not be changed once more until the upper or lower border of the block is reached.

Splitting things up into a big bunch of algorithms sometimes look like a good idea as each algorithm can then be maxed out. But on the other hand all the decisions have to be done for each single line, and easily consume the few saved cycles of a special case that only happens each now and then. It is wise to keep track of the optimization one does by benchmarking each new try. Thus it is easy to decide if a change is wort the few cycles for lots of additional code, or if the changes even slow things down.

# Going fullscreen

Having lines in a 16×16 square is nice, but how's about going fullscreen finally? One might think that this means a lot of pain, for a bunch of good reasons:

• x-coordinates can have a delta bigger than 255, no one loves 16 bit calculations on a 8 bit machine
• addressing of blocks in a bitmap is more complex than within a 16×16 grid, where each column can be addressed linear
• clearing the whole bitmap is no option, there needs to be a better concept

Luckily, there's also a bunch of good solutions to all those bothers!

As for x-coordinates, lines usually do not get that big lengths if you keep your model fairly complex, if it would really happen, just cut the line into two halves and draw 2 shorter lines. This however only applies to dx/dy, x-coordinates still need to be handled with 9 bits on a 320×200 screen. But the hassle with those is less than expected, you'll see!

As for the other two problems we can take advantage of the fact, that we use a screen together with a bitmap. When ink and paper on screen are set with the background color, the corresponding block of the bitmap is made “invisible”, as both background and drawn content are displayed in the same color. So clearing can be done by just writing for e.g. \$00 to the whole screen to camouflage the bitmap content. When drawing lines we would however need to also set the screen to make our blocks appear again. Cumbersome on the one hand, but also a cool thing, as we have by that a map of dirty blocks, on which we can do further optimizations. So what we do is walking relatively on a screen and bitmap, for that we use 2 pointers. One for the screen and one for the bitmap. For each direction the pointers are manipulated respectively. Also we would now be able to give each line a different colour (but of course would need to avoid clashes).

Thus, when we leave a block and enter a new block, we can now make decisions while consulting our map. If we enter a used block, we will need to merge in our new content the classic way (by using ora), but if we are on a pristine block, we can simply write through our new pixels without the need of ora. However we need to wipe out the lines of the block that we don't use, as they might contain garbage from older renderings, that would become visible on “turning on” that char in the map.

## Charmode

Now we could even go a step further and move that concept of a map over to charmode (see image with visualized screen that represents the final image). Here we have an easier handling as on every new block that we start just need to advance the charset pointer by 8. Also we would just need a forth of the space a regular bitmap would consume. However we are limited to 256 chars only, but to reach that margin you need very huge and complex objects. The objects in Comaland use a maximum of 192 chars, so that there are another 63 chars left for the logo and a clear char can be reserved. Of course you are also free to work with several charsets and split the screen vertically. But then even more complexity applies. Better start with a single charset first.

So how would all this work? First, we take an empty charset and define char \$00 as an empty char, we fill the screen with \$00 and wipe out the first 8 bytes of our charset. Thus we start with an empty srceen.

Now we draw the first line, render it block by block (dx / 8 and dy / 8). Due to that, a lot of optimizations like run length encoding can be applied, in x- and now even in y-direction! (more on that later) Whenever we leave a block, we decide on how to render the next block and what area we need to clear. If we move from a pristine block to another pristine block, we can best case clear 8 lines in a row (in 50% of the cases this can be optimized to 8 bytes being cleared linearly!). On leaving a pristine block, we also place the current char number on the screen, then we advance the char pointer and the char number to be used. The screen pointer will be bumped in any case to be able to access teh next block info, as the screen also acts as map for used blocks. Sounds complex but actually the following code snippets look rather simple, right?

;check current block type
lda (screen,x)            ;assume x = 0
beq no_ora_needed
ora_needed

inc_screen_pointer
inc scr
bne *+4
inc scr+1

dec_screen_pointer
lda #\$ff
dcp scr       ;compare + force carry always set
bne *+4
dec scr+1

So when we observe this pic, the workflow would be like the following:

Draw 2 pixels in one go until we leave the block either at the top or right side. Upon leaving we recognize due to the position in our unrolled code that we rendered until y = 3, so there's 3 lines below to clear, upon entering the next block we recognize by the position in our unrolled code that we start with y = 3, means we have to clear the 5 lines to the bottom of the block. As it is a pristine block, we can continue with write through mode and forgo on using ora.

Upon entering a dirty block that is already in use, we switch to ora mode. For that we pick the screen code and multiply it by 8 (+offset) to have the resulting pointer into the charset, a lookup table makes things easier and faster.

More examples (green ares need to be cleared):

Same but for a flat slope with decrementing y

Flat sloop with incrementing y

A special case were nothing needs to be cleared

Lines to be cleared when entering an already used block (ora is used from there on)

## Optimizations in y-direction.

The steep lines can be unrolled with ease, as we would maximum handle 8 lines in the unrolled code, as then a new block starts, the code would however also work in linear columns as being used with a 16×16 grid, but possibly checks on y need to be applied or things need to be draw blockwise as well. By unrolling the subtraction of dx can be handled before drawing pixels, then upon underflow n rows will be filled with the same value in one go. Thus the error does not need to be saved and restored per step, but per x step, and the bitmask also just needs to be loaded once per x step, while it can be shifted in an easier fashion too. Also, if a pixel was drawn for e.g. 4 times before, the next step can be simplified and 3 times dx already be subtracted without any check, in certain cases even all left pixels can be drawn without any further checks, as the next line segment must be either 3, 4 or 5 pixel long, depending if we just drew the minimum or maximum of pixels per chunk. This is, due to the fact, that we draw a fixed amount of pixels per step + 1 additional pixel from time to time to cope with the error, so a line would typically be drawn like:

X
X
X
X
X
X
X
X
X
X
X
X
X

or

X
X
X
X
X
X
X
X
X
X

So we can in the next step at least expect 3 pixels in a row, no matter if we follow the upper example where 4 pixels would be our minimum or the lower example where 4 pixels would be our maximum.

The said optimizations are only available in the write through mode and not if the new content needs to be ora'd in.

Upcoming a code snippet that expresses the said:

dec8_no_ora_entry
txa                ;fetch error
;8 lines to check
sbc dx
bcc dec8_exit_1    ;overflow on 1st line, 7 lines to go

sbc dx
bcc dec8_exit_2    ;overflow on 2nd line, 6 lines to go

sbc dx
bcc dec8_exit_3    ;overflow on 3rd line, 5 lines to go

sbc dx
bcc dec8_exit_4    ;...

sbc dx
bcc dec8_exit_5

sbc dx
bcc dec8_exit_6

sbc dx
bcc dec8_exit_7

sbc dx

tax

sta (cset),y
dey
sta (cset),y
dey
sta (cset),y
dey
sta (cset),y
dey
sta (cset),y
dey
sta (cset),y
dey
sta (cset),y
dey
jmp dupdate_y_s_     ;advance to next block

And one exit point as example:

dec8_exit_5
tax               ;save error

lda mask          ;write 5 pixels in a row
sta (cset),y
dey
sta (cset),y
dey
sta (cset),y
dey
sta (cset),y
dey
sta (cset),y
dey

bcs +             ;column change?

;3 sbcs won't underrun for sure, so we can draw the upcoming three dots without any further checks \o/
sta (cset),y
dey
sta (cset),y
dey
sta (cset),y
txa

adc+1 dydx        ;add dy and do three sbc dx without harm (you remember, we can't underrun so far)
sbc+1 dx
sbc+1 dx
tax
jmp dupdate_y_s   ;block change, all done
+
jmp dec_clear_3   ;preliminary column change, need to wipe 3 lines

## Clearing

As described, clearing can be done by simply wiping out the used screen. There's also the option to wipe out the charset until the last used char on each frame and forgo on the clearing per block. In practice it showed that the clearing on a block basis however performs already pretty fast. In case one would need unrolled code that is entered at the right offset, what will be quite costly with double buffering. For a faster clearing of the srceen one can log the screen per frame and thus build up a map of overall used blocks during the whole effect. Unused blocks do not need to be cleared over all. This makes the resulting speedcode faster and smaller and allows for clipped overlayed objects like the logo.