base:fast_sqrt
no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
Next revision | |||
— | base:fast_sqrt [2015-04-17 04:31] – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Square Root calculation ====== | ||
+ | |||
+ | Imagine you wanna have a square root: | ||
+ | |||
+ | < | ||
+ | R = sqrt(N) | ||
+ | </ | ||
+ | |||
+ | That obviously gives you: | ||
+ | |||
+ | < | ||
+ | R^2 = N | ||
+ | </ | ||
+ | |||
+ | Since we wanna establish a convergent algo, we simply use this during calculation: | ||
+ | |||
+ | < | ||
+ | R(n)^2 <= N | ||
+ | </ | ||
+ | |||
+ | R(0) will be 0. To get closer to N we add a third value D to R as long as that formula is true. Since we work with binary computers, this D will be of the kind 2^x (128, 64, 32 etc). | ||
+ | |||
+ | Ok assuming we wanna have the root from a 16 bit number and since the output of a sqrt of max 65535 (highest 16 bit input) is maximum 255.998 we start with a D of 128 and LSR it down to 1. On each iteration we try (R+D)^2 <= N and if that formula is true, we have R=R+D. If not, R stays unmodified. The resulting algo looks like this: | ||
+ | |||
+ | < | ||
+ | R = 0 | ||
+ | D = 128 | ||
+ | while (D >= 1) | ||
+ | { | ||
+ | temp = R+D | ||
+ | if (temp*temp <= N) then R=temp | ||
+ | D = D/2 | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | As you see this algo is very simple and apart from the temp*temp it does not involve any time consuming operation. The D/2 is ofcourse just a single LSR. | ||
+ | |||
+ | Ok but we are not satisfied. One single integer multiplication is too much for our little 6510 so we have to get rid of it. | ||
+ | |||
+ | Ok now we are not satisfied with the mul but how to get rid of it? First you have to see that (R+D)^2 is a simple binomical formula so you can do this: | ||
+ | |||
+ | < | ||
+ | (R+D)^2 = (R+D)*(R+D) = R*R + 2*R*D + D*D | ||
+ | </ | ||
+ | |||
+ | Still this doesn' | ||
+ | |||
+ | < | ||
+ | R*R + 2*R*D + D*D = R*R + D*(2*R + D) | ||
+ | </ | ||
+ | |||
+ | |||
+ | Ok what do we have now? 2*R is easy since it's just a simple ASL in assembler. A multiplication with D is also easy since D is always of the kind 2^x so D*something is equal to ASL something several times. But damnit, there' | ||
+ | |||
+ | < | ||
+ | M(n) <= N - R(n)^2 | ||
+ | </ | ||
+ | |||
+ | |||
+ | And since R(0) = 0: | ||
+ | |||
+ | < | ||
+ | M(0) = N | ||
+ | </ | ||
+ | |||
+ | |||
+ | As you see M(n) is just the difference from N and " | ||
+ | |||
+ | < | ||
+ | N(n) <= (R(n)+D(n))^2 | ||
+ | </ | ||
+ | |||
+ | |||
+ | Given the easy formula transformation from above we have: | ||
+ | |||
+ | < | ||
+ | N(n) <= R(n)^2 + D(n)*(2*R(n)+D(n)) | ||
+ | </ | ||
+ | |||
+ | |||
+ | Just substract R(n)^2 from both sides and get: | ||
+ | |||
+ | < | ||
+ | N(n) - R(n)^2 <= D(n)*(2*R(n)+D(n)) | ||
+ | </ | ||
+ | |||
+ | |||
+ | Now look a few lines above at the M(n) formula. Do you notice something? Exactly, there we have the front part of this formula too so it is proven that: | ||
+ | |||
+ | < | ||
+ | M(n) <= D(n)*(2*R(n)+D(n)) | ||
+ | </ | ||
+ | |||
+ | |||
+ | Time for a party, that R*R is gone! But how do we calculate those M(n)? Quite simple actually: | ||
+ | |||
+ | < | ||
+ | M(n+1) = M(n) - D(n)*(2*R(n)+D(n)) | ||
+ | </ | ||
+ | |||
+ | So we got a completely new algo: | ||
+ | |||
+ | < | ||
+ | M = N | ||
+ | D = 128 | ||
+ | while (D >= 1) | ||
+ | { | ||
+ | T = D*(2*R+D) | ||
+ | if (T <= M) then M=M-T : R = R+D | ||
+ | D = D/2 | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Yeeehaw! R*R is gone, and 2*R is easy (ASL) and D*whatever is easy too (ASL multiple times). To make this clear I change the muls into ASL's: | ||
+ | |||
+ | < | ||
+ | M = N | ||
+ | D = 128 | ||
+ | for n = 7 to 0 | ||
+ | { | ||
+ | T = (R+R+D) ASL n | ||
+ | if (T <= M) then M=M-T : R = R+D | ||
+ | D = D LSR 1 | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Wicked! BUT: We are not satisfied because of n-times ASL :) | ||
+ | |||
+ | Ok I stopped at the point where the multiplication was gone but a ASL with varying shift count appeared. That is no problem for big CPU's like 680x0 or x86, but our poor 6510 doesn' | ||
+ | |||
+ | To do that, we don't investigate the math any further but take a look at how the variables behave during iteration of this algo: | ||
+ | |||
+ | < | ||
+ | R = 0 | ||
+ | M = N | ||
+ | D = 128 | ||
+ | for n = 7 to 0 | ||
+ | { | ||
+ | T = (R+R+D) ASL n | ||
+ | if (T <= M) then M=M-T : R = R+D | ||
+ | D = D LSR 1 | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | D, T, M and R now look like this: | ||
+ | |||
+ | < | ||
+ | D T M R | ||
+ | 10000000 0100000000000000 xxxxxxxxxxxxxxxx x0000000 | ||
+ | 01000000 0x01000000000000 0xxxxxxxxxxxxxxx xx000000 | ||
+ | 00100000 00xx010000000000 00xxxxxxxxxxxxxx xxx00000 | ||
+ | 00010000 000xxx0100000000 000xxxxxxxxxxxxx xxxx0000 | ||
+ | 00001000 0000xxxx01000000 0000xxxxxxxxxxxx xxxxx000 | ||
+ | 00000100 00000xxxxx010000 00000xxxxxxxxxxx xxxxxx00 | ||
+ | 00000010 000000xxxxxx0100 000000xxxxxxxxxx xxxxxxx0 | ||
+ | 00000001 0000000xxxxxxx01 0000000xxxxxxxxx xxxxxxxx | ||
+ | </ | ||
+ | |||
+ | |||
+ | Removing the "ASL n" would completely change the behaviour of T. But also M has to change because otherwise the T <= M compare and M-T math will not work. So if T is not left-shifted, | ||
+ | |||
+ | < | ||
+ | D T M R | ||
+ | 10000000 0100000000000000 xxxxxxxxxxxxxxxx x0000000 | ||
+ | 01000000 x010000000000000 xxxxxxxxxxxxxxx0 xx000000 | ||
+ | 00100000 xx01000000000000 xxxxxxxxxxxxxx00 xxx00000 | ||
+ | 00010000 xxx0100000000000 xxxxxxxxxxxxx000 xxxx0000 | ||
+ | 00001000 xxxx010000000000 xxxxxxxxxxxx0000 xxxxx000 | ||
+ | 00000100 xxxxx01000000000 xxxxxxxxxxx00000 xxxxxx00 | ||
+ | 00000010 xxxxxx0100000000 xxxxxxxxxx000000 xxxxxxx0 | ||
+ | 00000001 xxxxxxx010000000 xxxxxxxxx0000000 xxxxxxxx | ||
+ | </ | ||
+ | |||
+ | So now instead of shifting T we shift M, but the advantage of shifting M is that it's only 1 bit per iteration. Look at the code now: | ||
+ | |||
+ | < | ||
+ | R = 0 | ||
+ | M = N | ||
+ | D = 128 | ||
+ | for n = 7 to 0 | ||
+ | { | ||
+ | T = (R+R+D) ASL 7 | ||
+ | if (T <= M) then M=M-T : R = R+D | ||
+ | M = M ASL 1 | ||
+ | D = D LSR 1 | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Much much better since ASL 7 can be replaced by a single LSR in asm code later. This code is quite useful on 6510 already, BUT ofcourse this is not everything. | ||
+ | |||
+ | Taking a slightly modified version of the last pseudo code: | ||
+ | |||
+ | < | ||
+ | R = 0 | ||
+ | M = N | ||
+ | D = 128 | ||
+ | for n = 7 to 0 | ||
+ | { | ||
+ | T = (R ASL 8) OR (D ASL 7) | ||
+ | if (T <= M) then M=M-T : R = R OR D | ||
+ | M = M ASL 1 | ||
+ | D = D LSR 1 | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | T can now be calculated quite easily. The high byte is simply R OR <some stuff> and the low byte is always 0 except for the last iteration where it is 128. The shifting of D is done via a table and voila, here is some 6510 code: | ||
+ | |||
+ | < | ||
+ | LDY #$00 ; R = 0 | ||
+ | LDX #$07 | ||
+ | .loop | ||
+ | TYA | ||
+ | ORA stab-1,X | ||
+ | STA THI ; (R ASL 8) | (D ASL 7) | ||
+ | LDA MHI | ||
+ | CMP THI | ||
+ | BCC .skip1 | ||
+ | SBC THI | ||
+ | STA MHI ; M = M - T | ||
+ | TYA | ||
+ | ORA stab,x | ||
+ | TAY ; R = R OR D | ||
+ | .skip1 | ||
+ | ASL MLO | ||
+ | ROL MHI ; M = M ASL 1 | ||
+ | DEX | ||
+ | BNE .loop | ||
+ | |||
+ | ; last iteration | ||
+ | |||
+ | STY THI | ||
+ | LDA MLO | ||
+ | CMP #$80 | ||
+ | LDA MHI | ||
+ | SBC THI | ||
+ | BCC .skip2 | ||
+ | INY ; R = R OR D (D is 1 here) | ||
+ | .skip2 | ||
+ | RTS | ||
+ | stab: .BYTE $01, | ||
+ | </ | ||
+ | |||
+ | I hope I didn't make any mistakes in that routine :) | ||
+ | |||
+ | Input is MLO/MHI for N and output is Y-register for int(sqrt(N)). | ||
+ | |||
+ | Ok guys, I finally tested the routine I did up there. Like said before, it only allows 14 bit input ($0000 to $41FF to be more accurate). The reason for this is that M needs one more bit. Ok, some people might want full 16 bit so here is a fixed routine which only has 3 opcodes more: | ||
+ | |||
+ | < | ||
+ | LDY #$00 ; R = 0 | ||
+ | LDX #$07 | ||
+ | CLC ; clear bit 16 of M | ||
+ | .loop | ||
+ | TYA | ||
+ | ORA stab-1,X | ||
+ | STA THI ; (R ASL 8) | (D ASL 7) | ||
+ | LDA MHI | ||
+ | BCS .skip0 | ||
+ | CMP THI | ||
+ | BCC .skip1 | ||
+ | .skip0 | ||
+ | SBC THI | ||
+ | STA MHI ; M = M - T | ||
+ | TYA | ||
+ | ORA stab,x | ||
+ | TAY ; R = R OR D | ||
+ | .skip1 | ||
+ | ASL MLO | ||
+ | ROL MHI ; M = M ASL 1 | ||
+ | DEX | ||
+ | BNE .loop | ||
+ | |||
+ | ; last iteration | ||
+ | |||
+ | BCS .skip2 | ||
+ | STY THI | ||
+ | LDA MLO | ||
+ | CMP #$80 | ||
+ | LDA MHI | ||
+ | SBC THI | ||
+ | BCC .skip3 | ||
+ | .skip2 | ||
+ | INY ; R = R OR D (D is 1 here) | ||
+ | .skip3 | ||
+ | RTS | ||
+ | stab: .BYTE $01, | ||
+ | </ | ||
+ | |||
+ | This routine works perfectly for all values from $0000 to $FFFF. | ||
+ | |||
+ | Here's an extended incarnation of the integer sqrt routine. This one is extended so it does proper rounding. It is achieved by adding another iteration and calculate bit -1 which then can be used to determine if the fractional part is >= .5 or not. | ||
+ | |||
+ | The disadvantage of rounding is that the output has to be 9 bits now, because inputs $FF01 to $FFFF result in 256 now (and not 255). Alternatively you also might modify this routine and simply use bit -1 for further calculations. This gives you more accuracy than rounding. | ||
+ | |||
+ | < | ||
+ | LDY #$00 ; R = 0 | ||
+ | LDX #$07 | ||
+ | CLC ; clear bit 16 of M | ||
+ | .loop | ||
+ | TYA | ||
+ | ORA stab-1,X | ||
+ | STA THI ; (R ASL 8) | (D ASL 7) | ||
+ | LDA MHI | ||
+ | BCS .skip0 | ||
+ | CMP THI | ||
+ | BCC .skip1 | ||
+ | .skip0 | ||
+ | SBC THI | ||
+ | STA MHI ; M = M - T | ||
+ | TYA | ||
+ | ORA stab,x | ||
+ | TAY ; R = R OR D | ||
+ | .skip1 | ||
+ | ASL MLO | ||
+ | ROL MHI ; M = M ASL 1 | ||
+ | DEX | ||
+ | BNE .loop | ||
+ | |||
+ | ; bit 0 iteration | ||
+ | |||
+ | STY THI | ||
+ | LDX MLO | ||
+ | LDA MHI | ||
+ | BCS .skip2 | ||
+ | CPX #$80 | ||
+ | SBC THI | ||
+ | BCC .skip3 | ||
+ | .skip2 | ||
+ | TXA | ||
+ | SBC #$80 | ||
+ | TAX | ||
+ | LDA MHI | ||
+ | SBC THI | ||
+ | STA MHI | ||
+ | INY ; R = R OR D (D is 1 here) | ||
+ | .skip3 | ||
+ | CPX #$80 | ||
+ | ROL MHI | ||
+ | |||
+ | ; bit -1 iteration and rounding ( + 0.5) | ||
+ | |||
+ | LDX #$00 | ||
+ | BCS .skip4 | ||
+ | CPY MHI | ||
+ | BCS .skip5 | ||
+ | .skip4 | ||
+ | INY ; R = R + 0.5 | ||
+ | BNE .skip5 | ||
+ | INX | ||
+ | .skip5 | ||
+ | ; R in X and Y (Y is low-byte) | ||
+ | |||
+ | RTS | ||
+ | stab: .BYTE $01, | ||
+ | </ | ||
+ | |||
+ | |||
+ | If you don't wanna have the rounding and prefer having bit -1, then use this final iteration: | ||
+ | |||
+ | < | ||
+ | ; bit -1 iteration | ||
+ | |||
+ | BCS .skip4 | ||
+ | LDX #$00 | ||
+ | CPY MHI | ||
+ | BCS .skip5 | ||
+ | .skip4 | ||
+ | LDX #$80 | ||
+ | .skip5 | ||
+ | ; R in Y, X contains bit -1 | ||
+ | |||
+ | RTS | ||
+ | </ | ||
+ | |||
+ | |||
+ | the next one is from: http:// | ||
+ | |||
+ | Returns the 8-bit square root in $20 of the | ||
+ | 16-bit number in $20 (low) and $21 (high). The | ||
+ | remainder is in location $21. | ||
+ | |||
+ | < | ||
+ | sqrt16 | ||
+ | LDY #$01 ; lsby of first odd number = 1 | ||
+ | STY $22 | ||
+ | DEY | ||
+ | STY $23 ; msby of first odd number (sqrt = 0) | ||
+ | again | ||
+ | SEC | ||
+ | LDA $20 ; save remainder in X register | ||
+ | TAX ; subtract odd lo from integer lo | ||
+ | SBC $22 | ||
+ | STA $20 | ||
+ | LDA $21 ; subtract odd hi from integer hi | ||
+ | SBC $23 | ||
+ | STA $21 ; is subtract result negative? | ||
+ | BCC nomore ; no. increment square root | ||
+ | INY | ||
+ | LDA $22 ; calculate next odd number | ||
+ | ADC #$01 | ||
+ | STA $22 | ||
+ | BCC again | ||
+ | INC $23 | ||
+ | JMP again | ||
+ | nomore | ||
+ | STY $20 ; all done, store square root | ||
+ | STX $21 ; and remainder | ||
+ | RTS | ||
+ | </ | ||
base/fast_sqrt.txt · Last modified: 2019-08-18 20:28 by verz