====== A Faster Radix Sort ======
by lft
This article describes an implementation of //radix sort// optimized for sprite
multiplexers. The input is a set of actors numbered 0 to //N//-1, where each actor
has a Y-position in the range 0–223 stored in an array (called ''ypos'') on the
zero-page. The routine will push the //N// actor numbers on the stack, in order of
increasing Y-coordinates.
For a real multiplexer, it is actually preferable to either:
* Push them in order of decreasing Y-coordinates, so they can be popped in top-to-bottom order, or
* Not push them at all, but instead let the multiplexer traverse a linked list.
It is straightforward to modify the presented algorithm to meet the above
requirements, and this will be discussed at the end. But the code is easier to
understand if everything is kept in forward order during the sort.
The complete implementation is included at the end of the article, but the code
is probably difficult to follow unless you read the text first.
===== Performance =====
The total execution time for 32 actors is 1970 cycles, which is 61.6 cycles per
actor. Every additional actor adds 51 cycles to the execution time. The
execution time is predictable, i.e. jitter-free.
The size of the code is about 2 kB, and it needs 60 bytes of zero-page storage
in addition to the Y-positions. It is possible to reduce this to 32 bytes at
the cost of a few extra cycles; doing so is left as an exercise for the reader.
===== Theory =====
We will perform a two-pass radix sort based on hexadecimal digits.
In the first pass, we place each actor in one of sixteen bins, based on the low
nybble of their Y-position. We use linked lists to represent the bins, but the
order of elements in each list is unimportant at this stage.
In the second pass, we take the actors out of the bins, starting with bin 0,
then bin 1, etc., and place each actor at the end of one of fourteen lists,
based on the high nybble. In this way, we end up with fourteen sorted lists:
The first list contains every actor with a Y-position in the range $00–$0f, in
order. The second list has all the actors with positions $10–$1f, in order. The
fourteenth list has all the actors with positions $d0–$df, in order.
Finally, we traverse each of the fourteen lists in order, pushing the elements
on the stack as we go. This completes the radix sort.
===== The linked lists =====
A linked list is typically represented by a global variable, indicating the
first element of the list, and a next-pointer for each actor. End-of-list can
be indicated by an invalid actor number, such as $ff.
Now, a common way of adding an element to a linked list is the following (in
pseudo-code):
next[i] = first
first = i
This inserts the new element at the beginning of the list. If //i// is already in
a register, this can be implemented with one load instruction and two store
instructions.
But we will do something else: We introduce a //tail-pointer//, a variable to
keep track of the address of the current end of the list. This variable might
point to the variable called //first//, or it might point to one of the cells
of the array called //next//. Now we can add an element to the end of the list like
this:
*tail = i
tail = &next[i]
If we make sure that the //next// array and the //first// variable are
located on the same page in memory, then we only need to update the LSB of the
pointer. Furthermore, if we place the //next// array at the very beginning of a
page, the pointer to element //i// becomes equal to //i//. Thus:
*tail = i
tail.lsb = i
If //i// is already in a register, this can be implemented with just two store
instructions—eliminating the need for a load.
In our algorithm, the actors are never located in more than one list at the
same time, so in principle we only need a single global array of next-pointers. But it turns out
that we can save a couple of cycles if we use two arrays, one for each pass of
the radix sort.
In the first pass, we need sixteen first-pointers and sixteen tail-pointers (one for each bin). In the
second pass, we need fourteen first-pointers and fourteen tail-pointers.
===== Chaining lists together =====
After the first stage, we have sixteen lists. In the second stage, we traverse
the elements of each of these lists in order. Observe that this is essentially
the same as if we first concatenate the sixteen lists into one, and then
traverse all the elements of that combined list.
Thanks to the tail-pointer representation, concatenating two lists is easy:
*tail_A = first_B
To concatenate several lists, we have to start at the end and move towards the
beginning of the final list:
*tail_C = first_D
*tail_B = first_C
*tail_A = first_B
This is because some of the lists might be empty, in which case their //first//
values would be uninitialized. Concatenating in the correct order ensures that
each //first// value is valid just before we need it.
Afterwards, //first_A// indicates the first actor of the concatenated list, and
we know that the list must contain exactly //N// elements (since all actors are
present). We can traverse the list using an unrolled loop, stopping after the
//N//th element. Thus, we don't need to bother with an end-of-list marker at all.
The final next-pointer can contain garbage.
===== The algorithm =====
The complete algorithm, in pseude-code, is as follows:
=== Stage 1: Reset the tail-pointers ===
for j = 0 to 15:
tail[j] = &first[j]
=== Stage 2: Collect actors by low-nybble ===
for each actor i:
j = ypos[i] & $0f
*tail[j] = i
tail[j] = &next[i] // i.e. tail[j].lsb = i
=== Stage 3: Concatenate the lists ===
for j = 14 down to 0:
*tail[j] = first[j + 1]
i = first[0] // Prepare a list iterator for Stage 5
=== Stage 4: Reset the tail-pointers ===
for j = 0 to 13:
tail[j] = &first[j]
=== Stage 5: Traverse intermediate list, collect actors by high-nybble ===
repeat as many times as there are actors:
j = ypos[i] >> 4
*tail[j] = i
tail[j] = &next[i] // i.e. tail[j].lsb = i
i = next[i]
=== Stage 6: Concatenate the lists ===
for j = 12 down to 0:
*tail[j] = first[j + 1]
i = first[0] // Prepare a list iterator for Stage 7
=== Stage 7: Push the elements in order ===
repeat as many times as there are actors:
push i on stack
i = next[i]
===== Indexed indirect mode =====
We need one more puzzle piece to see why this is efficient on the 6502. The
//indexed indirect addressing mode// is rarely useful, but here it suddenly
shines!
We keep the sixteen tail-pointers on the zero-page. Suppose we have the current
actor number (//i//) in the accumulator, and the desired list index (//j//) times
two in the X register. Then:
sta (tail_pointers,x) ; *tail[j] = i
sta tail_pointers,x ; tail[j].lsb = i
As discussed previously, the next-pointers have to be located at the beginning
of a page, and the first-pointers have to be located somewhere on the same
page. But the first-pointers don't have to be consecutive. They will be
immediate-operands in the code of Stages 3 and 6 (list concatenation).
===== 6502 implementation =====
In order to truly minimize execution time, we will use separate memory areas
for the two passes. This allows us to initialize both areas at once.
.var N_ACTOR = 32 // Must be even.
// Zero-page variables:
ypos: .fill N_ACTOR, 0 // External input.
loptr: .fill 32, 0
hiptr: .fill 28, 0
one_time_init:
lda #>lonext
.for(var i = 0; i < 16; i++) {
sta loptr+i*2+1
}
lda #>hinext
.for(var i = 0; i < 14; i++) {
sta hiptr+i*2+1
}
rts
sort_actors:
// Stages 1 and 4:
// The clean version:
// .for(i = 0; i < 16; i++) {
// lda #= 0; i--) {
lda #0 // operand is first[i+1]
sta (loptr+2*i),y
}
lda #0 // operand is first[0]
// 15 * 8 + 2 = 122
// Stage 5:
tax
.for(var i = 0; i < N_ACTOR; i++) {
ldy ypos,x
ldx hibits,y
sta (hiptr,x)
sta hiptr,x
.if(i != N_ACTOR - 1) {
tay
lax lonext,y
}
}
ldy #0
jmp join_hi
// 2 + 32 * 24 - 6 + 5 = 769
.align $100
hibits:
.fill $100, (i >> 4) * 2
hinext:
.fill N_ACTOR, 0
// Stage 6:
.align 8
join_hi:
.for(var i = 12; i >= 0; i--) {
lda #0 // operand is first[i+1]
sta (hiptr+2*i),y
}
lda #0 // operand is first[0]
// 13 * 8 + 2 = 106
// Stage 7:
.for(var i = 0; i < N_ACTOR; i += 2) {
pha
tay
lax hinext,y
pha
.if(i != N_ACTOR - 2) {
lda hinext,x
}
}
// 16 * 16 - 4 = 252
// Total cycle count:
// 108 + 613 + 122 + 769 + 106 + 252 = 1970
// Note: Don't rts here, since there is data on the stack.
===== Variants =====
Having the actor numbers on the stack isn't all that useful. If you'd rather
traverse the list during the visible portion of the display, as part of the
multiplexer, simply replace all of Stage 7 by:
sta next_actor
And then, to obtain each successive actor:
ldx next_actor
lda hinext,x
sta next_actor
// Do something with actor number X ...
Sometimes you do want the actor numbers on the stack, but you want them pushed
in bottom-to-top order so the multiplexer can pick them up in top-to-bottom
order. This can be achieved by modifying the look-up tables. Remember that the
Y-coordinate is in the range $00–$df. Flip the range to reverse the sort:
lobits:
.fill $100, (($df - i) & 15) * 2
hibits:
.fill $100, (($df - i) >> 4) * 2