base:fast_sqrt

# Differences

This shows you the differences between two versions of the page.

 — base:fast_sqrt [2015-04-17 04:31] (current) Line 1: Line 1: + ====== Square Root calculation ====== + + Imagine you wanna have a square root: + + <​code>​ + R = sqrt(N) + ​ + + That obviously gives you: + + <​code>​ + R^2 = N + ​ + + Since we wanna establish a convergent algo, we simply use this during calculation:​ + + <​code>​ + R(n)^2 <= N + ​ + + R(0) will be 0. To get closer to N we add a third value D to R as long as that formula is true. Since we work with binary computers, this D will be of the kind 2^x (128, 64, 32 etc). + + Ok assuming we wanna have the root from a 16 bit number and since the output of a sqrt of max 65535 (highest 16 bit input) is maximum 255.998 we start with a D of 128 and LSR it down to 1. On each iteration we try (R+D)^2 <= N and if that formula is true, we have R=R+D. If not, R stays unmodified. The resulting algo looks like this: + + <​code>​ + R = 0 + D = 128 + while (D >= 1) + { + temp = R+D + if (temp*temp <= N) then R=temp + D = D/2 + } + ​ + + As you see this algo is very simple and apart from the temp*temp it does not involve any time consuming operation. The D/2 is ofcourse just a single LSR. + + Ok but we are not satisfied. One single integer multiplication is too much for our little 6510 so we have to get rid of it. + + Ok now we are not satisfied with the mul but how to get rid of it? First you have to see that (R+D)^2 is a simple binomical formula so you can do this: + + <​code>​ + (R+D)^2 = (R+D)*(R+D) = R*R + 2*R*D + D*D + ​ + + Still this doesn'​t look THAT promising but let's try more formula changing: + + <​code>​ + R*R + 2*R*D + D*D = R*R + D*(2*R + D) + ​ + + + Ok what do we have now? 2*R is easy since it's just a simple ASL in assembler. A multiplication with D is also easy since D is always of the kind 2^x so D*something is equal to ASL something several times. But damnit, there'​s still that R*R biatch, let's send her to hell by introducing another variable called M: + + <​code>​ + M(n) <= N - R(n)^2 + ​ + + + And since R(0) = 0: + + <​code>​ + M(0) = N + ​ + + + As you see M(n) is just the difference from N and "​current N during algo" which gives us a new way to decide if D should be added to R or not. The only thing we need now is a way to calculate M(n) without calculating R(n)^2. This is a bit messy but it works with these given formulas: + + <​code>​ + N(n) <= (R(n)+D(n))^2 + ​ + + + Given the easy formula transformation from above we have: + + <​code>​ + N(n) <= R(n)^2 + D(n)*(2*R(n)+D(n)) + ​ + + + Just substract R(n)^2 from both sides and get: + + <​code>​ + N(n) - R(n)^2 <= D(n)*(2*R(n)+D(n)) + ​ + + + Now look a few lines above at the M(n) formula. Do you notice something? Exactly, there we have the front part of this formula too so it is proven that: + + <​code>​ + M(n) <= D(n)*(2*R(n)+D(n)) + ​ + + + Time for a party, that R*R is gone! But how do we calculate those M(n)? Quite simple actually: + + <​code>​ + M(n+1) = M(n) - D(n)*(2*R(n)+D(n)) + ​ + + So we got a completely new algo: + + <​code>​ + M = N + D = 128 + while (D >= 1) + { + T = D*(2*R+D) + if (T <= M) then M=M-T : R = R+D + D = D/2 + } + ​ + + Yeeehaw! R*R is gone, and 2*R is easy (ASL) and D*whatever is easy too (ASL multiple times). To make this clear I change the muls into ASL's: + + <​code>​ + M = N + D = 128 + for n = 7 to 0 + { + T = (R+R+D) ASL n + if (T <= M) then M=M-T : R = R+D + D = D LSR 1 + } + ​ + + Wicked! BUT: We are not satisfied because of n-times ASL :) + + Ok I stopped at the point where the multiplication was gone but a ASL with varying shift count appeared. That is no problem for big CPU's like 680x0 or x86, but our poor 6510 doesn'​t like this so that "ASL n" has to leave. + + To do that, we don't investigate the math any further but take a look at how the variables behave during iteration of this algo: + + <​code>​ + R = 0 + M = N + D = 128 + for n = 7 to 0 + { + T = (R+R+D) ASL n + if (T <= M) then M=M-T : R = R+D + D = D LSR 1 + } + ​ + + D, T, M and R now look like this: + + <​code>​ + D        T                M                R + 10000000 0100000000000000 xxxxxxxxxxxxxxxx x0000000 + 01000000 0x01000000000000 0xxxxxxxxxxxxxxx xx000000 + 00100000 00xx010000000000 00xxxxxxxxxxxxxx xxx00000 + 00010000 000xxx0100000000 000xxxxxxxxxxxxx xxxx0000 + 00001000 0000xxxx01000000 0000xxxxxxxxxxxx xxxxx000 + 00000100 00000xxxxx010000 00000xxxxxxxxxxx xxxxxx00 + 00000010 000000xxxxxx0100 000000xxxxxxxxxx xxxxxxx0 + 00000001 0000000xxxxxxx01 0000000xxxxxxxxx xxxxxxxx + ​ + + + Removing the "ASL n" would completely change the behaviour of T. But also M has to change because otherwise the T <= M compare and M-T math will not work. So if T is not left-shifted,​ M has to be right-shifted. The result looks like this: + + <​code>​ + D        T                M                R + 10000000 0100000000000000 xxxxxxxxxxxxxxxx x0000000 + 01000000 x010000000000000 xxxxxxxxxxxxxxx0 xx000000 + 00100000 xx01000000000000 xxxxxxxxxxxxxx00 xxx00000 + 00010000 xxx0100000000000 xxxxxxxxxxxxx000 xxxx0000 + 00001000 xxxx010000000000 xxxxxxxxxxxx0000 xxxxx000 + 00000100 xxxxx01000000000 xxxxxxxxxxx00000 xxxxxx00 + 00000010 xxxxxx0100000000 xxxxxxxxxx000000 xxxxxxx0 + 00000001 xxxxxxx010000000 xxxxxxxxx0000000 xxxxxxxx + ​ + + So now instead of shifting T we shift M, but the advantage of shifting M is that it's only 1 bit per iteration. Look at the code now: + + <​code>​ + R = 0 + M = N + D = 128 + for n = 7 to 0 + { + T = (R+R+D) ASL 7 + if (T <= M) then M=M-T : R = R+D + M = M ASL 1 + D = D LSR 1 + } + ​ + + Much much better since ASL 7 can be replaced by a single LSR in asm code later. This code is quite useful on 6510 already, BUT ofcourse this is not everything. ​ + + Taking a slightly modified version of the last pseudo code: + + <​code>​ + R = 0 + M = N + D = 128 + for n = 7 to 0 + { + T = (R ASL 8) OR (D ASL 7) + if (T <= M) then M=M-T : R = R OR D + M = M ASL 1 + D = D LSR 1 + } + ​ + + T can now be calculated quite easily. The high byte is simply R OR and the low byte is always 0 except for the last iteration where it is 128. The shifting of D is done via a table and voila, here is some 6510 code: + + <​code>​ + LDY #\$00    ; R = 0 + LDX #\$07 + .loop + TYA + ORA stab-1,X + STA THI     ; (R ASL 8) | (D ASL 7) + LDA MHI + CMP THI + BCC .skip1 ​ ; T <= M + SBC THI + STA MHI     ; M = M - T + TYA + ORA stab,x + TAY         ; R = R OR D + .skip1 + ASL MLO + ROL MHI     ; M = M ASL 1 + DEX + BNE .loop + + ; last iteration + + STY THI + LDA MLO + CMP #\$80 + LDA MHI + SBC THI + BCC .skip2 + INY         ; R = R OR D (D is 1 here) + .skip2 + RTS + stab:   .BYTE \$01,​\$02,​\$04,​\$08,​\$10,​\$20,​\$40,​\$80 + ​ + + I hope I didn't make any mistakes in that routine :) + + Input is MLO/MHI for N and output is Y-register for int(sqrt(N)). + + Ok guys, I finally tested the routine I did up there. Like said before, it only allows 14 bit input (\$0000 to \$41FF to be more accurate). The reason for this is that M needs one more bit. Ok, some people might want full 16 bit so here is a fixed routine which only has 3 opcodes more: + + <​code>​ + LDY #\$00    ; R = 0 + LDX #\$07 + CLC         ; clear bit 16 of M + .loop + TYA + ORA stab-1,X + STA THI     ; (R ASL 8) | (D ASL 7) + LDA MHI + BCS .skip0 ​ ; M >= 65536? then T <= M is always true + CMP THI + BCC .skip1 ​ ; T <= M + .skip0 + SBC THI + STA MHI     ; M = M - T + TYA + ORA stab,x + TAY         ; R = R OR D + .skip1 + ASL MLO + ROL MHI     ; M = M ASL 1 + DEX + BNE .loop + + ; last iteration + + BCS .skip2 + STY THI + LDA MLO + CMP #\$80 + LDA MHI + SBC THI + BCC .skip3 + .skip2 + INY         ; R = R OR D (D is 1 here) + .skip3 + RTS + stab:   .BYTE \$01,​\$02,​\$04,​\$08,​\$10,​\$20,​\$40,​\$80 + ​ + + This routine works perfectly for all values from \$0000 to \$FFFF. ​ + + Here's an extended incarnation of the integer sqrt routine. This one is extended so it does proper rounding. It is achieved by adding another iteration and calculate bit -1 which then can be used to determine if the fractional part is >= .5 or not. + + The disadvantage of rounding is that the output has to be 9 bits now, because inputs \$FF01 to \$FFFF result in 256 now (and not 255). Alternatively you also might modify this routine and simply use bit -1 for further calculations. This gives you more accuracy than rounding. + + <​code>​ + LDY #\$00    ; R = 0 + LDX #\$07 + CLC         ; clear bit 16 of M + .loop + TYA + ORA stab-1,X + STA THI     ; (R ASL 8) | (D ASL 7) + LDA MHI + BCS .skip0 ​ ; M >= 65536? then T <= M is always true + CMP THI + BCC .skip1 ​ ; T <= M + .skip0 + SBC THI + STA MHI     ; M = M - T + TYA + ORA stab,x + TAY         ; R = R OR D + .skip1 + ASL MLO + ROL MHI     ; M = M ASL 1 + DEX + BNE .loop + + ; bit 0 iteration + + STY THI + LDX MLO + LDA MHI + BCS .skip2 + CPX #\$80 + SBC THI + BCC .skip3 + .skip2 + TXA + SBC #\$80 + TAX + LDA MHI + SBC THI + STA MHI + INY         ; R = R OR D (D is 1 here) + .skip3 + CPX #\$80 + ROL MHI + + ; bit -1 iteration and rounding ( + 0.5) + + LDX #\$00 + BCS .skip4 + CPY MHI + BCS .skip5 + .skip4 + INY         ; R = R + 0.5 + BNE .skip5 + INX + .skip5 + ; R in X and Y (Y is low-byte) + + RTS + stab:   .BYTE \$01,​\$02,​\$04,​\$08,​\$10,​\$20,​\$40,​\$80 + ​ + + + If you don't wanna have the rounding and prefer having bit -1, then use this final iteration: + + <​code>​ + ; bit -1 iteration + + BCS .skip4 + LDX #\$00 + CPY MHI + BCS .skip5 + .skip4 + LDX #\$80 + .skip5 + ; R in Y, X contains bit -1 + + RTS + ​ + + + the next one is from: http://​www.geocities.com/​oneelkruns/​asm1step.html + + Returns the 8-bit square root in \$20 of the + 16-bit number in \$20 (low) and \$21 (high). The + remainder is in location \$21. + + <​code>​ + sqrt16 + LDY #\$01 ; lsby of first odd number = 1 + STY \$22 + DEY + STY \$23 ; msby of first odd number (sqrt = 0) + again + SEC + LDA \$20 ; save remainder in X register + TAX ; subtract odd lo from integer lo + SBC \$22 + STA \$20 + LDA \$21 ; subtract odd hi from integer hi + SBC \$23 + STA \$21 ; is subtract result negative? + BCC nomore ; no. increment square root + INY + LDA \$22 ; calculate next odd number + ADC #\$01 + STA \$22 + BCC again + INC \$23 + JMP again + nomore + STY \$20 ; all done, store square root + STX \$21 ; and remainder + RTS +