The Ninja-Method

Written by St0fF/Neoplasia^theObsessedManiacs

The basic idea has a pretty long history. I've heard of “using the CIA directly to compensate jitter” long ago. But the first code I saw that really does use this idea was a very optimized 2x2-fli routine by Wolfram Sang (Ninja / The Dreams). With a working routine on paper it's always a good start to write one's own routine. Now that I accomplished a 4×4-fli-routine based on this, I'll gladly explain the details.

Idea of a NMI-driven 4x4-routine

We'll set up CIA #2 to trigger a NMI every 8th rasterline (on PAL: a counter of 8*63-1 = 503 cycles). Then we set up CIA #1 Timer B to count down our Jitter, which means: CIA #1 Timer B must be started a little after the CIA #2 Timer. Why CIA #1, Timer B you might ask? Well, the CBM documentation about CIAs says: “Writing to a timer hi-value puts the written data into into the timer, if it is not running. Reading a timer returns the current count-down value.” So if we stopped a timer and write data to it in the correct lo/hi order, it will always read the last written 16bit value. Imagine we stop Timer A and put a value of $004c into it. The CPU will always read a Timer A value of $004c. If we were 1337 enough to execute $dc04, we'd execute

jmp $XX00

where XX == lo-value of Timer B counter in cycle 3 of the jmp command.

Now imagine we let the NMI execute this $dc04. The NMI takes 7 cycles to execute. The jmp takes 3 and we will have a jitter of 0-7 cycles. This means, our routines are executed between 10 to 17 cycles after the NMI-timer ran out. As it will jump to a certain routine for each jitter-value, we know exactly how many cycles to compensate in each routine.

For this amount of precision all we need is precision while setting it all up. Beware: the exact timing while setting all up is extremely crucial, because any $dc04-jmp to undefined memory will most likely crash. And as we use a NMI, every bad messing around with $dd0d might trigger a NMI and by that most likely crash. Also pressing RESTORE will most likely crash - but anybody who does that is out of his or her mind anyhow (as Oswald said)…

Caveats

$dd0d

To not mess around with $dd0d badly, you'll have to follow two simple rules: Activate the NMI by

lda #$81
bit $dd0d
sta $dd0d

Deactivate the NMI by

lda #$7f
sta $dd0d

Just do not bit $dd0d “just in case”, as it might create the crashing results I mentioned above.

6526 vs. 6526A

Another caveat is the difference between 6526 and 6526A. We know the 6526 triggers Interrupts one cycle after the timer ran out, but the 6526A triggers in the “right moment”. One could argue which is right, but it doesn't really matter. We just have to take the respective counter-measures. So all we need is a CIA-detection and a proper

if (6526) start NMI-timer one cycle earlier

As on both CIA-types the “normal counter operation” is the same, the jitter-timers need to be started at the same cycle for both types.

Actually JackAsser mentioned in the CSDB-forums, that we could go without any CIA-detection. But more on that in the “Advantages” section…

the C0DE (49374)

I'm sorry some labels have more or less German names. I translated all of my documentation, think that is sufficient.

;-------------------------------------------------------------------------------
;TODO: declare variables before !src'ing this file
;      (must be solvable even in the first pass!)
;
;NMI_base = address of the first NMI-routine in memory (ATTENTION: theoretically
;           those might be located anywhere (except underneath the IO), if the
;           CIA#1TimerB just counts down from a high enough value.  But most
;           demo coders like to make one of the timers reference-count exactly
;           one rasterline, which would make $0100-$3700 the range for NMI_base)
;zpreg    = zp-address for storing the accu during NMIs
;d018wert1= D018-value of the upper 4 pixel-rows
;d018wert2= D018-value of the lower 4 pixel-rows
;d011wert1= D011-value für D018wert1
;d011wert2= D011-value für D018wert2
;
;ATTENTION: this routine just does the syncing and starting of all timers.  You
;still need to create some raster-IRQs that start and stop the timer-NMIs.
;===============================================================================
!ifdef NMI_base {

.wartung	inx		;loop for waiting exactly 52 cycles
		ldy #7		;(incl. jsr .wartung)
.check6		dey
		bne .check6
.check_6	nop
.rts		rts

init4x4
;FIRST UP: CIA-Detection and Initialization
		and #0
		sta .CIA_type
		sta $dd05
		sta $dc0e	;stop all timers
		sta $dc0f
		sta $dd0e
		sta $dd0f
		ldy #$7f	;disallow all Timer-Interrupts
		sty $dc0d
		cmp $dc0d
		sty $dd0d
		cmp $dd0d
		lda #4		;prepare Detection (timer=4 cycles)
		sta $dd04
		bit $d011	;wait for border, then start ...
		bpl *-3
		lda #<.CIA_detect_nmi
		sta $fffa
		lda #>.CIA_detect_nmi
		sta $fffb
		lda #$81
		ldx #%10011001
		stx $dd0e
		sta $dd0d
		bit $dd0d
		dec .CIA_type
.CIA_detect_nmi pla
		pla
		pla
		sty $dd0d	;deactivate Timer-NMI
		cmp $dd0d

;ATTN: for mathematical purposes a line starts at cycle 0 and ends at cycle 62!

		ldx #$03	;half variance delay:
.check0		cpx $d012       ;check is at cycle    0  1  2  3  4  5  6
		bne .check0     ;cycle in rasterline
                                ;ending this command: 2  3  4  5  6  7  8
.check_0	jsr .wartung	;waste 54 cycles ... this just made 52 of them
		nop
		cpx $d012       ;now check in cycle: 60 61 62  0  1  2  3
.check1		beq .check_1
		cmp ($00),y     ;ending this command: 4  5  6  3  4  5  6
.check_1	jsr .wartung
		nop
.check2		cpx $d012       ;now check in cycle: 62  0  1 61 62  0  1
		beq .check_2
		bit $ea         ;ending this command: 4  3  4  3  4  3  4
.check_2	jsr .wartung
		bit $ea
		cpx $d012	;now check in cycle:  0 62  0 62  0 62  0
.check3		bne .check_3	;after this, we're at 2  2  2  2  2  2  2

;Calculation of timings:
;=======================
;- the first STA $d011 MUST end in cycle 13 (when starting to count at 0)
;- that means, it starts on cycle 9, 
;- that means LDA #D018WERT starts at cycle 1
;- with 7 cycles jitter another sta zp (3 cycles) happens before
;means: 3 (save accu) + 7 (Jitter) + 3 (jmp) + 7 (NMI itself) = 20
;-> NMI must execute at cycle 44, so the (ForceLoad+Run) command has to
;   do its write in cycle 42 (as the nmi then happens after cycle 43)
.check_3			;cycle-counting: (6526 / 6526A), starting
                                ;on cycle 3 ...
		lda .CIA_type	;4
.check4		bpl .check_4	;2/3

.check_4	lda #<8*63-1	;2
		sta $dc06	;4
		sta $dd04	;4
		lda #>8*63-1	;2
		sta $dc07	;4
		sta $dd05	;4
		lda #$4c	;2
		sta $dc04	;4
		lda #%10010001	;2
		sta $dd0e	;4 = 38/39 -> cycle 42 of this RL on 6526A

	.CIA_type = *+1
		ldx #0		;2
.check5		bmi .check_5	;3/2
.check_5			;=5/4

;2nd calculation:
;================
;NMI = 7 cycles, jmp = 3 cycles, max.Jitter = 7 cycles
;means: 17+1 cycles later Timer B shall be started to "land at" $0000. This also
;means that we have to start Timer B 17+1+hi(NMI_base) cycles later.

	!set .rest = (>NMI_base) + 14

	!do while .rest > 5 {
		nop
		!set .rest = .rest - 2
	}
	!if .rest = 4 {
		sta $dc0f
	} else {
		sta $dc0f,y
	}
        ;use CIA#1timerA as simple memory location for jmp $xx00
	!set .jmpval = (<NMI_base)*$100 + $4c
		+mv16im .jmpval,$dc04
	;use that as NMI-vector
		+mv16im $dc04,$fffa
		rts
		
;IMPORTANT: check all the "checks", we do not want to lose any cycles because a
;page boundary was crossed whithin a branch command.
!if ((>.check1 != >.check_1) OR (>.check2 != >.check_2) OR (>.check3 != >.check_3) OR (>.check4 != >.check_4) OR (>.check5 != >.check_5) OR (>.check6 != >.check_6)) {
	!serious "Page boundary crossed where it was a bad thing to happen.  Relocate the Code!"
}
;===============================================================================
!macro flinmi jitter, offset {
	* = offset + $0100*(8-jitter)
	!if (jitter & 1) = 0 {
		sta zpreg		;3
	} else {
		sta .thisreg		;4
	}
	!if jitter < 5 {
		bit $dd0d		;4
	    !if (jitter = 0) {
		nop			;2
		nop			;2
	    }
	}
	!if ((jitter & 3) = 1) OR ((jitter & 3) = 2) {
		nop			;2
	}
	;Der eigentliche FLI-IRQ
		lda #d018wert2	;damit muss dieser Befehl mit Zyklus 1 starten
		sta $d018
		lda #d011wert2
		sta $d011	;letzter Zyklus muss Zyklus 14 der RL sein!
		lda #d011wert1
		sta $d011
		lda #d018wert1
		sta $d018
		
	!if jitter >= 5 {
		bit $dd0d
	}
	!if (jitter & 1) = 0 {
		lda zpreg
	} else {
	.thisreg = *+1
		lda #0
	}
		rti
}
;-------------------------------------------------------------------------------
;create the NMI-Routines right on their spot
!set .oldaddr = *
!set antijitter = 0
!do {
	!set jitterVal = 8 - antijitter
	+flinmi jitterVal, NMI_base
	!set antijitter = antijitter+1
} while antijitter < 8
* = .oldaddr
;===============================================================================
} else {
	!serious "NMI_base not declared.  Assembly must fail!"
}

Advantages

no preparation for the next interrupt, no static starting cycle

If we create a “second” badline in the middle of a charline to switch to a different screen-ram, A Raster-IRQ would start at cycle #0 of the “desired to make bad” rasterline-1, while this routine's interrupt starts “just in time”, thus the raster-method loosing around 44 cycles (in this case of FLI), because we cannot tell VIC “do your raster-irq in cycle 17”. That time may be “abused” for jitter correction and preparing the next “condition”, like an “auxiliary timer method” routine does, but it's still wasted cycles.

Take the NMI instead: the CIA just reloads its timer - 0 cycles wasted, the only thing to do is create a starting and an ending condtion. As we reach one of our nmi-routines directly, if jitter == 7 we can start playing with VIC directly and after that acknowledge the NMI and do whatever. if jitter == 3 we could first acknowledge the nmi and then play with VIC. So basically we can abuse the unwanted jitter to save cycles and do some housekeeping - we just minimize all the “waiting for the right moment” and “wasting inaccuracies with nop”.

Advantage brought up by JackAsser

We could actually care less about the CIA-differences by adding another Jitter-Routine. So with a new CIA we'd only use the “0-7 cycles of jitter”-routines, with an old CIA (remember: it fires the Interrupt one cycle later) we'd use “1-8 cycles of jitter”-routines. Now it's up to you if you can waste or abuse (whatever makes you happy) another page for another routine. After all the NMI-routines are shorter than the CIA-detection (less than $20 bytes vs. a little more than $20 bytes), so it would save some space altogether, but the code distribution would be worse…

Disadvantages

Any Interrupt can jitter 0 to 7 cycles. This makes up the need to compensate with 8 different routines, which indeed means: we lose 8 pages, where our NMI-Routines are saved. But don't be a prick: a simple FLI does not need that much of RAM. One could f.e. mix the NMI-pages (let's call'em that) with data, or even locate other code around that…

To make the code-part work here you'll find a complete demo-part Please assemble test4x4fli.a with ACME 0.93. If you're missing any of the library-routines I use or get errors while compiling: these are my slightly changed library routines.

St0fF / Neoplasia

Codebase64 wiki

Table of Contents