====== Shell Sort (for 16-bit Elements) ======
by Fredrik Ramsberg, 29 Dec 2004
Here's my implementation of Shell Sort, a sort algorithm with rather amazing properties. All mathematicians that have tried to analyze this algorithm have failed, but there's lots of empirical data that suggests it's roughly O(n log n). I wish I could say I invented this algorithm, but I didn't. This is an equivalent Javascript routine that mimics the Shell Shot presented here:
function ShellSort(arr,length) {
var j, i, v, h, k;
for (h=1; h < length; h=3*h+1);
while (h=(h-1)/3)
for (i=h, j=i, v=arr[i]; i<=length; arr[j+h]=v, i++, j=i, v=arr[i])
while((j-=h) >= 0 && arr[j] > v)
arr[j+h]=arr[j];
}
This is source code for what is meant to be an efficient implementation of Shell Sort in 6502 assembler. This implementation can sort more than 32,000 16-bit values. The only reason it can't sort 32,767 values is that there still has to be room for the routine and a few bytes for temporary storage. There's presently no other sorting routine in the repository that can handle more than 256 values. The source code is in a format suitable for the excellent and free ACME cross-assembler by Marco Baye, but should be easy to convert for other assemblers.
While Shell Sort is very good for entirely unsorted arrays, it is also reasonably good for almost sorted arrays. However, if you happen to know that very few values are out of place OR that the values that are out of place are not very far from their right position, Insertion Sort is a better choice. Insertion Sort is also provided here, since Shell Sort is really just a clever extension of Insertion Sort.
Here are some examples of sort times @ 1MHz (10,000 values):
^ Operation ^ Insertion Sort ^ Shell Sort ^
| Array is entirely sorted from the start | 1.9s | 14.7s |
| 1 value is at the wrong end of the array | 2.9s | 15.7s |
| 10 values are at the wrong end of the array | 11.8s | 16.9s |
| 50 values are at the wrong end of the array | 51.3s | 17.6s |
| Array is entirely unsorted | 2464.2s | 30.5s |
To call the routine, create a word-array at address nnnn in memory. The first word should contain the number of bytes to be sorted (= 2 * the number of elements), then come all those elements. Next, sort the elements using Shell Short like this:
lda #nnnn
jsr shell_sort
or to perform an Insertion Sort:
lda #nnnn
jsr insertion_sort
In the code snippits above, < means the low-byte and > means the high-byte. Some assemblers use x & $FF for the low-byte and nnnn >> 8 for the high-byte.
Source Code for the Shell Sort (with Insertion Sort):
!to "shellsrt.o" ; An assembler directive to set out-file
!sl "shelllbl.a" ; Tells the assembler to write all label
; values to a file
*=$1000 ; Start address. Can safely be set to
; anything from $0100 to $fe00
j=$fb ; Uses two bytes. Has to be on zero-page
j_plus_h=$fd ; Uses two bytes. Has to be on zero-page
arr_length = j_plus_h ; Can safely use the same location as
; j_plus_h, but doesn't have to be on ZP
shell_sort ldy #h_high - h_low - 1
bne sort_main ; Always branch
insertion_sort ldy #0
sort_main sty h_start_index
cld
sta j
sta in_address
clc
adc #2
sta arr_start
stx j + 1
stx in_address + 1
txa
adc #0
sta arr_start + 1
ldy #0
lda (j),y
sta arr_length
clc
adc arr_start
sta arr_end
iny
lda (j),y
sta arr_length + 1
adc arr_start + 1
sta arr_end + 1
; for (h=1; h < length; h=3*h+1);
ldx h_start_index ; Start with highest value of h
chk_prev_h lda h_low,x
cmp arr_length
lda h_high,x
sbc arr_length + 1
bcc end_of_init ; If h < array_length, we've found the right h
dex
bpl chk_prev_h
rts ; array length is 0 or 1. No sorting needed.
end_of_init inx
stx h_index
; while (h=(h-1)/3)
h_loop dec h_index
bpl get_h
rts ; All done!
get_h ldy h_index
lda h_low,y
sta h
clc
adc in_address ; ( in_address is arr_start - 2)
sta i
lda h_high,y
sta h + 1
adc in_address + 1
sta i + 1
; for (i=h, j=i, v=arr[i]; i<=length; arr[j+h]=v, i++, j=i, v=arr[i])
i_loop lda i
clc
adc #2
sta i
sta j
lda i + 1
adc #0
sta i + 1
sta j + 1
ldx i
cpx arr_end
lda i + 1
sbc arr_end + 1
bcs h_loop
ldy #0
lda (j),y
sta v
clc
adc #1
sta v_plus_1
iny
lda (j),y
sta v + 1
adc #0
bcs i_loop ; v=$ffff, so no j-loop necessary
sta v_plus_1 + 1
dey ; Set y=0
; while((j-=h) >= 0 && arr[j] > v)
j_loop lda j
sta j_plus_h
sec
sbc h
sta j
tax
lda j + 1
sta j_plus_h + 1
sbc h + 1
sta j + 1
; Check if we've reached the bottom of the array
bcc exit_j_loop
cpx arr_start
sbc arr_start + 1
bcc exit_j_loop
; Do the actual comparison: arr[j] > v
lda (j),y
tax
iny ; Set y=1
lda (j),y
cpx v_plus_1
sbc v_plus_1 + 1
bcc exit_j_loop
; arr[j+h]=arr[j];
lda (j),y
sta (j_plus_h),y
dey ; Set y=0
txa
sta (j_plus_h),y
bcs j_loop ; Always branch
; for (i=h, j=i, v=arr[i]; i2, >8, >26, >80, >242, >728, >2186, >6560, >19682
h_start_index !byte 0
h_index !byte 0
h !word 0
in_address !word 0
arr_start !word 0
arr_end !word 0
i !word 0
v !word 0
v_plus_1 !word 0
To increase speed and reduce code size, you can optionally place one or more of these 2-byte fields on zero-page (the suggested values work on a Commodore 64):
v_plus_1 = $5
h = $7
arr_start = $A
Some simple tests using an array of 10,000 completely unsorted values showed a 5.6% shorter execution time if all three fields were placed on ZP, with v_plus_1 being a little more important than the others.
To go even further, placing these 2-byte fields on zero-page will provide a small improvement:
v
i
arr_end
(This paragraph is added by litwr.) It is possible to speed up this sort by 15-25%. This requires only to change //h_high// and //h_low// tables. For example,
h_low .byte <2, <8, <20, <46, <114, <264, <602, <1402, <3500, <9518, <25846
h_high .byte >2, >8, >20, >46, >114, >264, >602, >1402, >3500, >9518, >25846
will make the trick.