base:speeding_up_and_optimising_demo_routines
no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
— | base:speeding_up_and_optimising_demo_routines [2015-04-17 04:33] (current) – created - external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Speeding up & Optimising demo routines ====== | ||
+ | This article is mainly aimed for novice to intermediate coders of the Commodore 64, as professionals will already know this riff-raff. | ||
+ | |||
+ | For many coders, when they begin to write their very nice looking demo effects, they sometimes fall into the black hole of slow speed. | ||
+ | |||
+ | ===== Problems with the Kernal ===== | ||
+ | Kernal routines are perhaps the worst routines ever to use when it comes to consideration of speed. | ||
+ | |||
+ | Here is a typical example, | ||
+ | < | ||
+ | SEI | ||
+ | loop LDA #$3B | ||
+ | CMP $D012 | ||
+ | BNE *-3 | ||
+ | |||
+ | DEC $D020 ; This is to check how much rastertime is used via scanlines. | ||
+ | |||
+ | ; And now the slow routine... | ||
+ | LDA #$01 | ||
+ | STA $0286 ; Set the cursor colour to white | ||
+ | ; ($01 = VIC colour white, $0286 = Current cursor colour). | ||
+ | JSR $E544 ; Kernal routine to clear the screen. | ||
+ | |||
+ | INC $D020 | ||
+ | |||
+ | JMP loop | ||
+ | </ | ||
+ | When you execute this routine, it uses more rastertime than the CPU can handle per frame, thus slowing down the computer. | ||
+ | |||
+ | Below is a screenshot of what the result looks like on a c64. Look closely at where the light green scanline-colour (controlled by DEC $D020) intersects with the scanline position that was compared by $d012 - #$3B. Because it intersects and passes scanline #$3b, the raster comparing has to go through all scanlines again, causing the frame rate to decrease to 25(PAL) or 30(NTSC) - thus slowing the execution down by half of what it should be. | ||
+ | |||
+ | {{base: | ||
+ | |||
+ | So how can we come around this problem? | ||
+ | |||
+ | We will now look at the code which does EXACTLY the same process, but hand coded by better knowledge: | ||
+ | < | ||
+ | SEI | ||
+ | loop1 LDA #$3B | ||
+ | CMP $D012 | ||
+ | BNE *-3 | ||
+ | DEC $D020 ; This is to check how much rastertime is used. | ||
+ | |||
+ | ;and now OUR version of the routine... | ||
+ | LDX #$00 | ||
+ | loop2 LDA #$01 ; Here is where we store our character colour | ||
+ | STA $D800, | ||
+ | STA $D900, | ||
+ | STA $DA00, | ||
+ | STA $DB00, | ||
+ | |||
+ | LDA #$20 ; Here we store our character to be placed on each part of the screen | ||
+ | STA $0400, | ||
+ | STA $0500, | ||
+ | STA $0600, | ||
+ | STA $0700, | ||
+ | | ||
+ | INX | ||
+ | BNE loop2 | ||
+ | |||
+ | INC $D020 | ||
+ | JMP loop1 | ||
+ | </ | ||
+ | This method obviously uses more bytes in memory, but technically it is much faster than the kernal method. | ||
+ | |||
+ | Here is a c64 screen shot of the better result. | ||
+ | |||
+ | {{base: | ||
+ | |||
+ | If you want to keep your code fairly short, then by all means use the kernal routines, but use them OUTSIDE real-time procedures. | ||
+ | |||
+ | If you use IRQ timing, you may use the kernal outside if you wish, but if you use kernal routines that modify the graphics in any way, you are more likely to see the ugly side effects on your screen, depending how much raster time your IRQ routines use. | ||
+ | |||
+ | You can also completely switch off the kernal by setting zero page $01. Bit position 1 (or #$02) of $01 sets the C64 memory chip to allow users to call built in routines of the Kernal ROM, stored from $E000 to $FFFF. | ||
+ | By simply switching it off like this... | ||
+ | < | ||
+ | LDA $01 | ||
+ | AND #$FD | ||
+ | STA $01 | ||
+ | </ | ||
+ | ...the KERNAL ROM is then disabled, and you are also free to write from $E000 to $FFFF RAM. It is a good thing to have the kernal switched off when wanting to execute code fast. The only downfall though is that when you begin to write IRQ routines, they need to be set up differently. | ||
+ | |||
+ | Here is a code example of a typical IRQ routine when the KERNAL is switched ON: | ||
+ | < | ||
+ | SEI | ||
+ | LDA #$01 | ||
+ | STA $D01A | ||
+ | LDA #<irq | ||
+ | LDX #>irq | ||
+ | LDY #$32 | ||
+ | STA $0314 | ||
+ | STX $0315 | ||
+ | STY $D012 | ||
+ | LDA #$1B | ||
+ | STA $D011 | ||
+ | LDA #$7F | ||
+ | STA $DC0D | ||
+ | LDA $DC0D | ||
+ | CLI | ||
+ | JMP * | ||
+ | |||
+ | irq | ||
+ | (your code here) | ||
+ | INC $D019 | ||
+ | JMP $EA7E | ||
+ | </ | ||
+ | Of course there is theoretically nothing wrong with this method. | ||
+ | |||
+ | You can however, still write an IRQ routine with the kernal switched OFF. Shown below is sample code of how to do this... | ||
+ | < | ||
+ | SEI | ||
+ | LDA #$35 | ||
+ | STA $01 ;Switch off the KERNAL ROM via value #$35 | ||
+ | LDA #$01 | ||
+ | STA $D01A | ||
+ | LDA #<irq | ||
+ | LDX #> | ||
+ | LDY #$32 | ||
+ | STA $FFFE | ||
+ | STX $FFFF | ||
+ | STY $D012 | ||
+ | LDA #$1B | ||
+ | STA $D011 | ||
+ | LDA #$7F | ||
+ | STA $DC0D | ||
+ | LDA $DC0D | ||
+ | CLI | ||
+ | JMP * | ||
+ | |||
+ | irq STA $02 | ||
+ | LDA $DC0D | ||
+ | STX $03 | ||
+ | STY $04 | ||
+ | (your code here) | ||
+ | |||
+ | LDA #$01 | ||
+ | STA $D019 | ||
+ | LDY $04 | ||
+ | LDX $03 | ||
+ | LDA $02 | ||
+ | RTI | ||
+ | </ | ||
+ | As you can see, there are quite a few changes compared to the previous example. | ||
+ | |||
+ | Now take a look at the " | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ===== Using efficient Opcodes ===== | ||
+ | Over the years I have always took the pleasure of watching some really great demos and then take a peek at the source code with an MC monitor. | ||
+ | |||
+ | Here is a particular example...a section of code to perform a 3x3 scroller: | ||
+ | < | ||
+ | ldx #$00 | ||
+ | loop lda $0401,x | ||
+ | sta $0400,x | ||
+ | lda $0429,x | ||
+ | sta $0428,x | ||
+ | lda $0451,x | ||
+ | sta $0450,x | ||
+ | inx | ||
+ | cpx #$27 | ||
+ | bne loop | ||
+ | |||
+ | (your scroll text code here) | ||
+ | |||
+ | rts | ||
+ | </ | ||
+ | Theoretically there is nothing wrong with this piece of code, as it simply performs a left scroller. | ||
+ | |||
+ | The best way to avoid this problem is to use opcodes that take up less cycles than the ones currently used. There are many C64 programming manuals and internet links available, which display a table of the MOS Technology 6502/6510 opcodes, including the number of cycles each opcode uses. My advice is to read and thoroughly understand which opcodes use less cycles, as this will be a very handy technique to write faster code. | ||
+ | |||
+ | Now, let's take a closer look at the example code above. | ||
+ | |||
+ | According to the opcodes table in many programming manuals, | ||
+ | |||
+ | 39 * (2 + 2 + 2) = 234 cycles. | ||
+ | |||
+ | This only goes for those three opcodes in the loop. We still need to calculate the number of cycles for the rest of the routine. | ||
+ | |||
+ | lda $0000, | ||
+ | |||
+ | sta $0000, | ||
+ | |||
+ | Because this is a 3x3 scroller, these two opcodes are theoretically called 3 times, | ||
+ | |||
+ | ( (5 + 5) * 3 ) * 39 = 1170 cycles. | ||
+ | |||
+ | Now add this with the previous calculation we did: 1170 + 234 = 1404 cycles in TOTAL (plus however many cycles used for the rest of the code) | ||
+ | |||
+ | Judging from all this, it is quite inefficient for speed. | ||
+ | |||
+ | So, instead we can take out that useless loop feature and then write the code like this: | ||
+ | < | ||
+ | lda $0401,x | ||
+ | sta $0400,x | ||
+ | lda $0429,x | ||
+ | sta $0428,x | ||
+ | lda $0451,x | ||
+ | sta $0450,x | ||
+ | ^^ | ||
+ | | ||
+ | | ||
+ | |||
+ | (your scroll text code here) | ||
+ | |||
+ | rts | ||
+ | </ | ||
+ | The syntax is correct, however logically it won't work. It will however use lesser cycles. | ||
+ | 1170 cycles. | ||
+ | |||
+ | Remember that LDA $0000,x and STA $0000,x are both equal to up to 5 cycles. | ||
+ | |||
+ | LDA $0000 | ||
+ | STA $0000 | ||
+ | |||
+ | So now we can write the code like this: | ||
+ | |||
+ | < | ||
+ | lda $0401 | ||
+ | sta $0400 | ||
+ | lda $0429 | ||
+ | sta $0428 | ||
+ | lda $0451 | ||
+ | sta $0450 | ||
+ | ^^ | ||
+ | lda $0402 | ||
+ | sta $0401 | ||
+ | lda $042A | ||
+ | sta $0429 | ||
+ | lda $0452 | ||
+ | sta $0451 | ||
+ | | ||
+ | | ||
+ | |||
+ | (your scroll text code here) | ||
+ | |||
+ | rts | ||
+ | </ | ||
+ | This version is correct in both syntax and logical, and uses the following number of cycles: | ||
+ | |||
+ | ( 4 + 4 * 3 ) * 39 = **936** cycles! | ||
+ | |||
+ | This is a result of 468 cycle difference between the original version and the optimised version, which in theory is way faster on your machine. | ||
+ | |||
+ | |||
+ | ===== Avoid repetitive use of Sub-routines ===== | ||
+ | |||
+ | There is always one popular advantage and disadvantage about using the JSR and RTS opcodes. | ||
+ | |||
+ | The best advice is to try and avoid them in speed-critical code. Sometimes however it is quite difficult to avoid and that you must make way for use of them in some places - for example, playing music. | ||
+ | |||
+ | Remember - think before you do. | ||
+ | |||
+ | ===== Shrinking down optimized code for better crunching performance ===== | ||
+ | |||
+ | Now that we have our nicely optimized and faster piece of code, there is still one slight problem. | ||
+ | |||
+ | There are however some very nice tricks in shortening down your code for crunching. | ||
+ | |||
+ | < | ||
+ | *=$1000 | ||
+ | |||
+ | lda $d012 ;Compare raster | ||
+ | cmp $d012 ; | ||
+ | bne *-3 ; | ||
+ | lda #$02 ;Do some random effects... | ||
+ | sta $d020 ; | ||
+ | sta $d021 ; | ||
+ | lda #$03 ; | ||
+ | sta $d021 | ||
+ | lda #$04 ; | ||
+ | sta $d021 | ||
+ | lda #$05 | ||
+ | sta $d021 | ||
+ | lda #$06 | ||
+ | sta $d021 | ||
+ | sta $d020 | ||
+ | |||
+ | [ NOTE: a copy of the code above is repeated 64 ($80) times from the start location ] | ||
+ | </ | ||
+ | Looking at the example, this section of code above is stored 64 times in memory. | ||
+ | |||
+ | What you **could** write is a routine which copies and pastes the same code every time... | ||
+ | |||
+ | < | ||
+ | ;This code is used in set-up before you actually start the IRQ | ||
+ | *=$0f00 | ||
+ | |||
+ | ;Store the start memory location in zeropage pointers | ||
+ | lda #< | ||
+ | sta $02 | ||
+ | lda #> | ||
+ | sta $03 | ||
+ | | ||
+ | ldx #64 ; | ||
+ | |||
+ | loop | ||
+ | ldy #$00 ;Always set Y at zero | ||
+ | | ||
+ | lda CodeSource, | ||
+ | sta ($02),y | ||
+ | |||
+ | iny | ||
+ | |||
+ | ;compare | ||
+ | cpy #$27 ;Number of bytes the raw code uses. | ||
+ | bne loop | ||
+ | |||
+ | tya ;Add offset | ||
+ | clc | ||
+ | adc $02 ;Sets carry, if: (value in $02 >= (256 % 256)) | ||
+ | sta $02 | ||
+ | lda $03 | ||
+ | adc #$00 ;Becomes #$01 is there is a carry from before. | ||
+ | sta $03 | ||
+ | | ||
+ | dex | ||
+ | bne loop | ||
+ | |||
+ | [....jump to irq from this point forward...] | ||
+ | | ||
+ | |||
+ | ;This is the raw code to be copied and pasted each time. | ||
+ | | ||
+ | *=$1000 | ||
+ | |||
+ | lda $d012 | ||
+ | cmp $d012 | ||
+ | beq *+2 | ||
+ | lda #$02 | ||
+ | sta $d020 | ||
+ | sta $d021 | ||
+ | lda #$03 | ||
+ | sta $d021 | ||
+ | lda #$04 | ||
+ | sta $d021 | ||
+ | lda #$05 | ||
+ | sta $d021 | ||
+ | lda #$06 | ||
+ | sta $d021 | ||
+ | sta $d020 | ||
+ | </ | ||
+ | Let's go through this useful routine step by step. | ||
+ | |||
+ | First of all, you need to know how many bytes the main raw code uses. Simply do this by entering the raw code into $1000 (using an MC-monitor) and then count how many bytes it uses via the the memory position. | ||
+ | |||
+ | Next, we need to store the number of times cloning in the X register - ldx # | ||
+ | |||
+ | Zeropage addresses $02 and $03 are used to carry the memory pointer of where the code is written. | ||
+ | |||
+ | - It starts off with the start position of the code $1000 - or $03 = $10, $02 = $00. | ||
+ | |||
+ | - Each time in the loop, these addresses are added by the size of the raw code - $27. In other words: | ||
+ | |||
+ | [Value in $02] = [Value in $02] + #$27 | ||
+ | |||
+ | - If there is a carry [Value in $02 goes over #$ff and back to #$00] then $03 is incremented. | ||
+ | |||
+ | |||
+ | Now that all of this is coded in, the length of the whole code will be a sum of: | ||
+ | |||
+ | Length of the clone routine [$0f00-$0f26] | ||
+ | |||
+ | (Simplified: | ||
+ | |||
+ | $26 + $27 = $4d bytes! | ||
+ | |||
+ | This is a difference of: $1380 - **$4d** | ||
+ | |||
+ | Now that the whole code has shrunk down, this will make the output crunched file and smaller size. | ||
+ | Remember that this can only be recommended to execute at set-up. | ||
+ | |||
+ | ---- | ||
+ | ---- | ||
+ | |||
+ | Thanks for reading! |
base/speeding_up_and_optimising_demo_routines.txt · Last modified: 2015-04-17 04:33 by 127.0.0.1