base:a_faster_radix_sort
no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
— | base:a_faster_radix_sort [2020-04-08 00:03] (current) – created lft | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== A Faster Radix Sort ====== | ||
+ | by lft | ||
+ | |||
+ | This article describes an implementation of //radix sort// optimized for sprite | ||
+ | multiplexers. The input is a set of actors numbered 0 to //N//-1, where each actor | ||
+ | has a Y-position in the range 0–223 stored in an array (called '' | ||
+ | zero-page. The routine will push the //N// actor numbers on the stack, in order of | ||
+ | increasing Y-coordinates. | ||
+ | |||
+ | For a real multiplexer, | ||
+ | |||
+ | * Push them in order of decreasing Y-coordinates, | ||
+ | |||
+ | * Not push them at all, but instead let the multiplexer traverse a linked list. | ||
+ | |||
+ | It is straightforward to modify the presented algorithm to meet the above | ||
+ | requirements, | ||
+ | understand if everything is kept in forward order during the sort. | ||
+ | |||
+ | The complete implementation is included at the end of the article, but the code | ||
+ | is probably difficult to follow unless you read the text first. | ||
+ | |||
+ | ===== Performance ===== | ||
+ | |||
+ | The total execution time for 32 actors is 1970 cycles, which is 61.6 cycles per | ||
+ | actor. Every additional actor adds 51 cycles to the execution time. The | ||
+ | execution time is predictable, | ||
+ | |||
+ | The size of the code is about 2 kB, and it needs 60 bytes of zero-page storage | ||
+ | in addition to the Y-positions. It is possible to reduce this to 32 bytes at | ||
+ | the cost of a few extra cycles; doing so is left as an exercise for the reader. | ||
+ | |||
+ | ===== Theory ===== | ||
+ | |||
+ | We will perform a two-pass radix sort based on hexadecimal digits. | ||
+ | |||
+ | In the first pass, we place each actor in one of sixteen bins, based on the low | ||
+ | nybble of their Y-position. We use linked lists to represent the bins, but the | ||
+ | order of elements in each list is unimportant at this stage. | ||
+ | |||
+ | In the second pass, we take the actors out of the bins, starting with bin 0, | ||
+ | then bin 1, etc., and place each actor at the end of one of fourteen lists, | ||
+ | based on the high nybble. In this way, we end up with fourteen sorted lists: | ||
+ | The first list contains every actor with a Y-position in the range $00–$0f, in | ||
+ | order. The second list has all the actors with positions $10–$1f, in order. The | ||
+ | fourteenth list has all the actors with positions $d0–$df, in order. | ||
+ | |||
+ | Finally, we traverse each of the fourteen lists in order, pushing the elements | ||
+ | on the stack as we go. This completes the radix sort. | ||
+ | |||
+ | ===== The linked lists ===== | ||
+ | |||
+ | A linked list is typically represented by a global variable, indicating the | ||
+ | first element of the list, and a next-pointer for each actor. End-of-list can | ||
+ | be indicated by an invalid actor number, such as $ff. | ||
+ | |||
+ | Now, a common way of adding an element to a linked list is the following (in | ||
+ | pseudo-code): | ||
+ | |||
+ | < | ||
+ | next[i] = first | ||
+ | first = i | ||
+ | </ | ||
+ | |||
+ | This inserts the new element at the beginning of the list. If //i// is already in | ||
+ | a register, this can be implemented with one load instruction and two store | ||
+ | instructions. | ||
+ | |||
+ | But we will do something else: We introduce a // | ||
+ | keep track of the address of the current end of the list. This variable might | ||
+ | point to the variable called //first//, or it might point to one of the cells | ||
+ | of the array called //next//. Now we can add an element to the end of the list like | ||
+ | this: | ||
+ | |||
+ | < | ||
+ | *tail = i | ||
+ | tail = & | ||
+ | </ | ||
+ | |||
+ | If we make sure that the //next// array and the //first// variable are | ||
+ | located on the same page in memory, then we only need to update the LSB of the | ||
+ | pointer. Furthermore, | ||
+ | page, the pointer to element //i// becomes equal to //i//. Thus: | ||
+ | |||
+ | < | ||
+ | *tail = i | ||
+ | tail.lsb = i | ||
+ | </ | ||
+ | |||
+ | If //i// is already in a register, this can be implemented with just two store | ||
+ | instructions—eliminating the need for a load. | ||
+ | |||
+ | In our algorithm, the actors are never located in more than one list at the | ||
+ | same time, so in principle we only need a single global array of next-pointers. But it turns out | ||
+ | that we can save a couple of cycles if we use two arrays, one for each pass of | ||
+ | the radix sort. | ||
+ | |||
+ | In the first pass, we need sixteen first-pointers and sixteen tail-pointers (one for each bin). In the | ||
+ | second pass, we need fourteen first-pointers and fourteen tail-pointers. | ||
+ | |||
+ | ===== Chaining lists together ===== | ||
+ | |||
+ | After the first stage, we have sixteen lists. In the second stage, we traverse | ||
+ | the elements of each of these lists in order. Observe that this is essentially | ||
+ | the same as if we first concatenate the sixteen lists into one, and then | ||
+ | traverse all the elements of that combined list. | ||
+ | |||
+ | Thanks to the tail-pointer representation, | ||
+ | |||
+ | < | ||
+ | *tail_A = first_B | ||
+ | </ | ||
+ | |||
+ | To concatenate several lists, we have to start at the end and move towards the | ||
+ | beginning of the final list: | ||
+ | |||
+ | < | ||
+ | *tail_C = first_D | ||
+ | *tail_B = first_C | ||
+ | *tail_A = first_B | ||
+ | </ | ||
+ | |||
+ | This is because some of the lists might be empty, in which case their //first// | ||
+ | values would be uninitialized. Concatenating in the correct order ensures that | ||
+ | each //first// value is valid just before we need it. | ||
+ | |||
+ | Afterwards, //first_A// indicates the first actor of the concatenated list, and | ||
+ | we know that the list must contain exactly //N// elements (since all actors are | ||
+ | present). We can traverse the list using an unrolled loop, stopping after the | ||
+ | //N//th element. Thus, we don't need to bother with an end-of-list marker at all. | ||
+ | The final next-pointer can contain garbage. | ||
+ | |||
+ | ===== The algorithm ===== | ||
+ | |||
+ | The complete algorithm, in pseude-code, | ||
+ | |||
+ | === Stage 1: Reset the tail-pointers === | ||
+ | |||
+ | < | ||
+ | for j = 0 to 15: | ||
+ | tail[j] = & | ||
+ | </ | ||
+ | |||
+ | === Stage 2: Collect actors by low-nybble === | ||
+ | |||
+ | < | ||
+ | for each actor i: | ||
+ | j = ypos[i] & $0f | ||
+ | *tail[j] = i | ||
+ | tail[j] = & | ||
+ | </ | ||
+ | |||
+ | === Stage 3: Concatenate the lists === | ||
+ | |||
+ | < | ||
+ | for j = 14 down to 0: | ||
+ | *tail[j] = first[j + 1] | ||
+ | |||
+ | i = first[0] | ||
+ | </ | ||
+ | |||
+ | === Stage 4: Reset the tail-pointers === | ||
+ | |||
+ | < | ||
+ | for j = 0 to 13: | ||
+ | tail[j] = & | ||
+ | </ | ||
+ | |||
+ | === Stage 5: Traverse intermediate list, collect actors by high-nybble === | ||
+ | |||
+ | < | ||
+ | repeat as many times as there are actors: | ||
+ | j = ypos[i] >> 4 | ||
+ | *tail[j] = i | ||
+ | tail[j] = & | ||
+ | i = next[i] | ||
+ | </ | ||
+ | |||
+ | === Stage 6: Concatenate the lists === | ||
+ | |||
+ | < | ||
+ | for j = 12 down to 0: | ||
+ | *tail[j] = first[j + 1] | ||
+ | |||
+ | i = first[0] | ||
+ | </ | ||
+ | |||
+ | === Stage 7: Push the elements in order === | ||
+ | |||
+ | < | ||
+ | repeat as many times as there are actors: | ||
+ | push i on stack | ||
+ | i = next[i] | ||
+ | </ | ||
+ | |||
+ | ===== Indexed indirect mode ===== | ||
+ | |||
+ | We need one more puzzle piece to see why this is efficient on the 6502. The | ||
+ | //indexed indirect addressing mode// is rarely useful, but here it suddenly | ||
+ | shines! | ||
+ | |||
+ | We keep the sixteen tail-pointers on the zero-page. Suppose we have the current | ||
+ | actor number (//i//) in the accumulator, | ||
+ | two in the X register. Then: | ||
+ | |||
+ | < | ||
+ | sta | ||
+ | sta | ||
+ | </ | ||
+ | |||
+ | As discussed previously, the next-pointers have to be located at the beginning | ||
+ | of a page, and the first-pointers have to be located somewhere on the same | ||
+ | page. But the first-pointers don't have to be consecutive. They will be | ||
+ | immediate-operands in the code of Stages 3 and 6 (list concatenation). | ||
+ | |||
+ | ===== 6502 implementation ===== | ||
+ | |||
+ | In order to truly minimize execution time, we will use separate memory areas | ||
+ | for the two passes. This allows us to initialize both areas at once. | ||
+ | |||
+ | < | ||
+ | .var N_ACTOR = 32 // Must be even. | ||
+ | | ||
+ | // Zero-page variables: | ||
+ | |||
+ | ypos: | ||
+ | |||
+ | loptr: | ||
+ | hiptr: | ||
+ | </ | ||
+ | |||
+ | < | ||
+ | one_time_init: | ||
+ | lda #> | ||
+ | .for(var i = 0; i < 16; i++) { | ||
+ | sta | ||
+ | } | ||
+ | lda #> | ||
+ | .for(var i = 0; i < 14; i++) { | ||
+ | sta | ||
+ | } | ||
+ | rts | ||
+ | </ | ||
+ | |||
+ | < | ||
+ | sort_actors: | ||
+ | // Stages 1 and 4: | ||
+ | |||
+ | // The clean version: | ||
+ | |||
+ | // .for(i = 0; i < 16; i++) { | ||
+ | // lda #< | ||
+ | // sta | ||
+ | // } | ||
+ | // .for(i = 0; i < 14; i++) { | ||
+ | // lda #< | ||
+ | // sta | ||
+ | // } | ||
+ | |||
+ | // But instead we do the following: | ||
+ | |||
+ | ldx #$fb | ||
+ | .for(var i = 0; i < 8; i++) { | ||
+ | lda #< | ||
+ | sax | ||
+ | sta | ||
+ | .if(i != 7) { | ||
+ | sax | ||
+ | sta | ||
+ | } | ||
+ | } | ||
+ | // 2 + 8 * 14 - 6 = 108 | ||
+ | |||
+ | // Stage 2: | ||
+ | |||
+ | .for(var i = 0; i < N_ACTOR; i++) { | ||
+ | ldy | ||
+ | ldx | ||
+ | lda #i | ||
+ | sta | ||
+ | sta | ||
+ | } | ||
+ | ldy #0 | ||
+ | jmp | ||
+ | |||
+ | // 32 * 19 + 5 = 613 | ||
+ | |||
+ | .align | ||
+ | lobits: | ||
+ | .fill $100, (i & 15) * 2 | ||
+ | lonext: | ||
+ | .fill | ||
+ | |||
+ | // Stage 3: | ||
+ | |||
+ | .align | ||
+ | join_lo: | ||
+ | .for(var i = 14; i >= 0; i--) { | ||
+ | lda # | ||
+ | sta | ||
+ | } | ||
+ | lda # | ||
+ | |||
+ | // 15 * 8 + 2 = 122 | ||
+ | |||
+ | // Stage 5: | ||
+ | |||
+ | tax | ||
+ | .for(var i = 0; i < N_ACTOR; i++) { | ||
+ | ldy | ||
+ | ldx | ||
+ | sta | ||
+ | sta | ||
+ | .if(i != N_ACTOR - 1) { | ||
+ | tay | ||
+ | lax | ||
+ | } | ||
+ | } | ||
+ | ldy #0 | ||
+ | jmp | ||
+ | |||
+ | // 2 + 32 * 24 - 6 + 5 = 769 | ||
+ | |||
+ | .align | ||
+ | hibits: | ||
+ | .fill $100, (i >> 4) * 2 | ||
+ | hinext: | ||
+ | .fill | ||
+ | |||
+ | // Stage 6: | ||
+ | |||
+ | .align | ||
+ | join_hi: | ||
+ | .for(var i = 12; i >= 0; i--) { | ||
+ | lda # | ||
+ | sta | ||
+ | } | ||
+ | lda # | ||
+ | |||
+ | // 13 * 8 + 2 = 106 | ||
+ | |||
+ | // Stage 7: | ||
+ | .for(var i = 0; i < N_ACTOR; i += 2) { | ||
+ | pha | ||
+ | tay | ||
+ | lax | ||
+ | pha | ||
+ | .if(i != N_ACTOR - 2) { | ||
+ | lda | ||
+ | } | ||
+ | } | ||
+ | // 16 * 16 - 4 = 252 | ||
+ | |||
+ | // Total cycle count: | ||
+ | // 108 + 613 + 122 + 769 + 106 + 252 = 1970 | ||
+ | |||
+ | // Note: Don't rts here, since there is data on the stack. | ||
+ | </ | ||
+ | |||
+ | ===== Variants ===== | ||
+ | |||
+ | Having the actor numbers on the stack isn't all that useful. If you'd rather | ||
+ | traverse the list during the visible portion of the display, as part of the | ||
+ | multiplexer, | ||
+ | |||
+ | < | ||
+ | sta | ||
+ | </ | ||
+ | |||
+ | And then, to obtain each successive actor: | ||
+ | |||
+ | < | ||
+ | ldx | ||
+ | lda | ||
+ | sta | ||
+ | |||
+ | // Do something with actor number X ... | ||
+ | </ | ||
+ | |||
+ | Sometimes you do want the actor numbers on the stack, but you want them pushed | ||
+ | in bottom-to-top order so the multiplexer can pick them up in top-to-bottom | ||
+ | order. This can be achieved by modifying the look-up tables. Remember that the | ||
+ | Y-coordinate is in the range $00–$df. Flip the range to reverse the sort: | ||
+ | |||
+ | < | ||
+ | lobits: | ||
+ | .fill $100, (($df - i) & 15) * 2 | ||
+ | </ | ||
+ | |||
+ | < | ||
+ | hibits: | ||
+ | .fill $100, (($df - i) >> 4) * 2 | ||
+ | </ |
base/a_faster_radix_sort.txt · Last modified: 2020-04-08 00:03 by lft