User Tools

Site Tools


base:advanced_optimizing

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
base:advanced_optimizing [2016-01-20 11:16] – [SAX/SHA] bitbreakerbase:advanced_optimizing [2024-03-03 11:06] (current) – [ASR] bitbreaker
Line 321: Line 321:
 Further advantage of this method is, that we have an additional register free, as it is not used for an index anymore. But be aware! You have to take into account, that you have to store values top-down, as the stack-pointer decreases on every push. The advantage is, that if an interrupt occurs in between, it will not trash your values on the stack, as it pushes its 3 bytes (PC + Status) below your current position. All you need to take care of is, that you don't under-run the stack in case of an interrupt (needs 3 bytes, if you do a JSR in the interrupt-handler, another 2 bytes are needed per level), or trash still valid content in the upper part of the stack. Further advantage of this method is, that we have an additional register free, as it is not used for an index anymore. But be aware! You have to take into account, that you have to store values top-down, as the stack-pointer decreases on every push. The advantage is, that if an interrupt occurs in between, it will not trash your values on the stack, as it pushes its 3 bytes (PC + Status) below your current position. All you need to take care of is, that you don't under-run the stack in case of an interrupt (needs 3 bytes, if you do a JSR in the interrupt-handler, another 2 bytes are needed per level), or trash still valid content in the upper part of the stack.
 For reading out your values from stack you can either use pla but much easier via e.g. lda $0100,x For reading out your values from stack you can either use pla but much easier via e.g. lda $0100,x
 +
 +===== Counting with steps greater than 1 =====
 +
 +Later we will discover to do that also by SBX, but there's also another option to do that easily and being able to use LAX features for the index or even function that we walk along
 +
 +<code>
 +count = $20
 +           ldx #$00
 +           ldy #$00
 +-
 +           stx count,y
 +           iny
 +           txa
 +           sbx #-3
 +           cpx #$60
 +           bne -
 +
 +           ...
 +
 +.index     lax count
 +           ...
 +           do stuff with X and A
 +           ...
 +           inc .index + 1
 +</code>
 +
 +As you see the inc .index + 1 will fetch the value from the next location in zeropage on the next turn Thus we have A and X increased by 3 on each round, all done in 9 cycles, and with the option of destroying x later on.
  
 ===== Counting bits ===== ===== Counting bits =====
Line 691: Line 718:
  
 The advantage is, that you can move bits also across registers and are not restricted to the accumulator only. The advantage is, that you can move bits also across registers and are not restricted to the accumulator only.
 +
 +When shifting, we handle 9 bits, as the bit falling out at one edge of the byte will be the new carry, and the old carry will be shifted in. This will introduce a gap of one bit, when we wrap around bits:
 +
 +<code>
 +        lda #%11111111
 +        clc
 +        rol
 +        rol
 +        ;-> A = %11111101
 +        ;              ^
 +        ;             gap :-(
 +</code>
 +
 +To avoid this behavior there's several ways around it:
 +
 +<code>
 +        lda #%11111111
 +        asl
 +        adc #0
 +        
 +        ...
 +        
 +        lda #%11111111
 +        anc #$ff
 +        rol
 +        
 +        ...
 +        
 +        lda #%11111111
 +        cmp #$80
 +        rol
 +</code>
 +
 +This way bit 7 is copied to carry first and then shifted in on the right end again.
 +
 +If you deal with chars, you often need numbers divided by 8, this also includes numbers bigger than 8 bits, as the screen is 320 pixels wide. If you include clipping you might even span over a bigger range.
 +An easy way to shift 11 bits to a final 8 bit results without having to deal with two different bytes being shifted independently, is the following:
 +
 +<code>
 +        lda xhi        ;00000hhh
 +        asr #$0f       ;000000hh h - might also be a lsr in case if no upper bits need to be clamped
 +        ora xlo        ;lllll0hh h
 +        ror            ;hlllll0h h
 +        ror            ;hhlllll0 h
 +        ror            ;hhhlllll 0
 +</code>
 +
 +As the least significant 3 bits are lost during the shift anyway, we place the bits for the highbyte there and rotate them back in on the left side, so all we need to shift then is a single byte. To make the rotation work, the highbyte needs to be preshiftet by one before the lowbyte is merged in. The only prerequisite of this method is, that the lowbyte must have least significant three bits cleared. 
 ====== Jumpcode ====== ====== Jumpcode ======
  
Line 832: Line 907:
                  
 In the same way this method can also be used to set bits (for e.g. with adc #$81) or to toggle bits. In the same way this method can also be used to set bits (for e.g. with adc #$81) or to toggle bits.
 +
 +When masking out bits, SAX or SBX is often a good choice.
 + 
 +<code>
 +       lax value
 +       and #%11110000
 +       sta highnibble
 +</code>
 +
 +After this we need to restore from X to mask the lower bits, better then another lda value, but still. 
 +
 +<code>      
 +       lda value
 +       ldx #%11110000
 +       sax highnibble
 +</code>
 +
 +This looks already better, we have the original value still in A and can do another mask operation.
 +
 +<code>
 +       lax value
 +       eor #%000011111
 +       sax highnibble       
 +</code>
 +
 +This looks even better, we can reuse X here and also A still contains the original bits, but in an inverted manner. So this opens up more options of reusing the original value at more than one register which gives potential for further savings.
 +This was spotted in Krill's loader when doing lookups on the GCR tables, so thanks to Krill here :-)
 ====== Illegal opcodes ====== ====== Illegal opcodes ======
  
Line 874: Line 976:
 Actually you can use LAX also with an immediate value, but it behaves a bit unstable regarding the given immediate value. However when simply doing an LAX #$00 you are fine. Actually you can use LAX also with an immediate value, but it behaves a bit unstable regarding the given immediate value. However when simply doing an LAX #$00 you are fine.
  
 +
 +lda $xxxx,y is not available as 8 bit version, so an lda $xx,y is not possible. With lax $xx,y there is howeever a way to imitate a lda $xx,y at the cost of destroying x.
 ===== SAX/SHA ===== ===== SAX/SHA =====
  
Line 903: Line 1007:
  
 as you see, this wastes just one byte more than an unrolled loop as in the upcoming example, but saves 2 cycles on every second byte written. as you see, this wastes just one byte more than an unrolled loop as in the upcoming example, but saves 2 cycles on every second byte written.
 +
 +This trick also helps when you need to switch 8 sprite pointers in a line. Usually one could just set up 2 different pointers at two different screens and switch 8 sprite pointers via $d018. But this is not applicable if your effect renders stuff into the screen or if you are doing even double buffering with screens. Here you have to fall back to writing 8 new sprite pointers in less then 44 cycles (63-19), but then also cope with possible jitter that is added. Preloading registers will then only help if you have a stable enough irq position, for e.g. achieved by a double irq. Here this fast writing of 8 values helps.
 +
 +The only thing to take care is, that #sprites is an even number (for odd numbers the sax and sta statements need to be swapped and y should be used for writing the last value). Now we are able to write 8 sprite pointers in 38 cycles.
  
 <code> <code>
Line 925: Line 1033:
 An y-index version of //SAX// exists in the illegal opcode //SHA//. However it also adds the highbyte+1 of the used address as a mask to the value written. So in most cases you are restricted to certain destination addresses. An y-index version of //SAX// exists in the illegal opcode //SHA//. However it also adds the highbyte+1 of the used address as a mask to the value written. So in most cases you are restricted to certain destination addresses.
  
-This trick also helps when you need to switch 8 sprite pointers in a line. Usually one could just set up 2 different pointers at two different screen and switch 8 sprite pointers via $d018. But this is not applicable if your effect renders stuff into the screen or if you are doing even double buffering with srceens. Here you have to fall back to writing 8 new sprite pointers in less then 44 cycles, but then also cope with possible jitter that is added. Preloading registers will then only help if you have a stable enough irq position, for e.g. achieved by a double irq. 
  
-However the following can work without any additional means: 
- 
-<code> 
-        clc 
-        lda #sprites+0+1 
-        ldx #sprites+0+$fe 
- 
-        sax screen  + $3f8 + 0 
-        sta screen  + $3f8 + 1 
-        adc #$02 
-        sax screen  + $3f8 + 2 
-        sta screen  + $3f8 + 3 
-        adc #$02 
-        sax screen  + $3f8 + 4 
-        sta screen  + $3f8 + 5 
-        adc #$02 
-        sax screen  + $3f8 + 6 
-        sta screen  + $3f8 + 7 
-</code> 
- 
-The only thing to take care is, that sprites is an even number. Now we are able to write 8 sprite pointers in 38 cycles. 
 ===== SHX/SHY ===== ===== SHX/SHY =====
  
-When storing to zeropage you can also store the y- and x-register with an index in a fast and comfortable way. But often you will need the zeropage for other things. Sadly the instruction set of the 6510 is not orthogonal and thus this features are not available for 16 bit addresses. You can however workaround that nuisance by using SHX or SHY, but have to cope with the H component in it, as the stored values are anded with the highbyte of the destination address + 1. So most of the time you might want to store to $fexx to not run into any problems. In case you have to apply an additional static mask, or if you just need certain bits of teh stored values, you can of course choose a different address.+When storing to zeropage you can also store the y- and x-register with an index in a fast and comfortable way. But often you will need the zeropage for other things. Sadly the instruction set of the 6510 is not orthogonal and thus this features are not available for 16 bit addresses. You can however workaround that nuisance by using SHX or SHY, but have to cope with the H component in it, as the stored values are anded with the highbyte of the destination address + 1. So most of the time you might want to store to $fexx to not run into any problems. In case you have to apply an additional static mask, or if you just need certain bits of the stored values, you can of course choose a different address. If you start crossing a page with the index, the behaviour of this opcode changes radically. In those cases the Y-value becomes the highbyte of the address the values is stored at
  
 Want some example? Want some example?
Line 972: Line 1058:
  
 <code> <code>
-        and #$ff+        and #$fe
         lsr         lsr
-        clc 
 </code> </code>
 ===== ARR ===== ===== ARR =====
Line 1170: Line 1255:
 </code> </code>
  
-Another good use can be made if you want to do a dec ($xx),y what is actually not available. So here dcp ($xx),y will help you out, as it is also available for the indirect y adressing mode. +Another good use can be made if you want to do a inc/dec ($xx),y what is actually not available. So here isc/dcp ($xx),y will help you out, as it is also available for the indirect y adressing mode. 
  
 +f.e.:
 +
 +<code>
 +ldy #..
 +lda (zp),y
 +clc
 +adc #..
 +sta (zp),y
 +bcc +
 +iny
 +isc (zp),y
 ++
 +</code>
 +
 +or
 +
 +<code>
 +ldy #..
 +lda (zp),y
 +sec
 +sbc #..
 +sta (zp),y
 +bcs +
 +iny
 +dcp (zp),y
 ++
 +</code>
 For decrementing a 16 bit pointer it is also of good use: For decrementing a 16 bit pointer it is also of good use:
  
Line 1363: Line 1475:
 So always try to form the term into something new and see if it performs better this way. So just remember the simple mathematic laws. So always try to form the term into something new and see if it performs better this way. So just remember the simple mathematic laws.
  
 +Now also think of that classical negation term:
 +
 +<code>
 +          lda num
 +          eor #$ff
 +          clc
 +          adc #$01
 +          sta neg
 +</code>
 +
 +Depending on what you have in register A, you can express it in many different ways:
 +
 +<code>
 +          ;a = $ff; carry set
 +          eor num
 +          adc #$00
 +          sta neg
 +          
 +          ;a = $00; carry set;
 +          sbc num
 +          sta neg
 +          
 +          ;a = $ff; carry clear
 +          adc num
 +          eor #$ff
 +          sta neg
 +          
 +          ;a = $00; carry clear;
 +          adc #$01
 +          sbc num
 +          sta neg
 +          
 +          ;num in a, carry set
 +          lda num
 +          sbc #$01
 +          eor #$ff
 +</code>
 +
 +There are of course also other expressions possible, just ponder a while about the term. Also the carry flag after the negation can be influenced, depending on using sbc or adc for most cases ($00/$ff will cause an overflow).
 +
 +How about forming terms with logical operations? We notice, that for e.g. (a + b) xor $ff is the same as (a xor $ff) - b:
 +
 +<code>
 +          lda num1
 +          clc
 +          adc num2
 +          eor #$ff
 +
 +          ;can also be written as
 +          lda num1
 +          eor #$ff
 +          sec
 +          sbc num2
 +</code>
 ====== Running out of registers ====== ====== Running out of registers ======
  
Line 1378: Line 1544:
         tsx ;fetch value from table again         tsx ;fetch value from table again
 </code> </code>
 +
 +====== Limiting and masking ======
 +
 +Sometimes it occurs, that we want to extract the low nibble of a value and limit it to a given range.
 +
 +<code>
 +        bpl .positive
 +        cmp #$f0
 +        bcs +
 +        lda #$f0
 ++
 +        and #$0f
 +</code>
 +
 +As you can see, we limit the value to $f0 .. $ff first and then clamp of the highnibble to end up with values that range from $00..$0f
 +
 +Observe, how this can be done cheaper, by just shifting the range and making use of the wrap around of 8 bits/carry:
 +
 +<code>
 +        bpl .positive
 +        ;clc
 +        adc #$10
 +        bcs +
 +        lda #$00
 ++
 +</code>
 +
 +We add $10 so the limit is then reached, depending on the carry. As we now wrapped the 8 bits by overflowing, the upper bits are already zero and we can forgo on the and #$0f component. The lownibble is not affected, as we focus on the lower 4 bits only.
  
 ====== Misc stuff ====== ====== Misc stuff ======
Line 1447: Line 1641:
  
 <code> <code>
-        lda bmp+        lda bmp       ;could also use lax bmp, sbx #$08, stx bmp to save more cycles
         sec         sec
         sbc #$08         sbc #$08
Line 1520: Line 1714:
 **HAPPY OPTIMIZING!** **HAPPY OPTIMIZING!**
  
-Bitbreaker/Oxyron^Nuance+Bitbreaker/Performers^Nuance
base/advanced_optimizing.1453285012.txt.gz · Last modified: 2016-01-20 11:16 by bitbreaker