base:decoding_bitstreams
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
base:decoding_bitstreams [2015-07-20 10:50] – lft | base:decoding_bitstreams [2015-07-20 11:28] (current) – [Conclusion] lft | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Decoding bitstreams for fun and profit ====== | ||
+ | by lft | ||
+ | |||
+ | This article describes a technique for extracting bitfields from a long | ||
+ | sequence of bytes stored in RAM. | ||
+ | |||
+ | As an example application, | ||
+ | 5-bit character codes. The entire text could then be stored as a bitstream, | ||
+ | from which you read five bits at a time. But you might save some space if you | ||
+ | represent, say, the eight most common characters as 0xxx, and all other | ||
+ | characters as 1xxxxx. (This would also give you 40 different characters, rather | ||
+ | than 32.) In that case, you'd first want to read a 1-bit field, to | ||
+ | differentiate between the two cases. Then you'd read either a 3-bit field or a | ||
+ | 5-bit field. | ||
+ | |||
+ | We will discuss how to do this efficiently and elegantly on the 6502. In | ||
+ | particular, we will look at a technique that performs the two-stage procedure | ||
+ | described above, and even navigates arbitrary decision trees, as part of its | ||
+ | normal operation. | ||
+ | |||
+ | The schoolbook application for this kind of routine would be a Lempel-Ziv-Welch | ||
+ | decruncher or a Huffman decoder. But anything is possible! For instance, you | ||
+ | could use it to parse note events in a playroutine, | ||
+ | machine, or entropy encoded sound samples. | ||
+ | |||
+ | We will start with a simple design, and then add complexity step by step, also | ||
+ | optimising it to the point where the complete decoder is quite devilish to | ||
+ | follow. | ||
+ | |||
+ | ===== From bytes to bits ===== | ||
+ | |||
+ | At the heart of the bitfield decoder is the // | ||
+ | mini-buffer of pending bits, represented as a single byte in the zero-page. As | ||
+ | we shift out bits, the buffer occasionally becomes empty, at which time a new | ||
+ | byte is loaded into it. The clever part is how we represent the shifter. This | ||
+ | is an old established technique, but it can be rather baffling when you see it | ||
+ | for the first time. | ||
+ | |||
+ | The idea is that the shifter contains (from left to right) 0--7 bits of pending data, | ||
+ | followed by a single 1-bit that we'll refer to as the token, followed by zeros. | ||
+ | So, the following shifter contains three bits of data (1, 0, 1): | ||
+ | |||
+ | {{: | ||
+ | |||
+ | At program start, the shifter is initialised to $80. | ||
+ | |||
+ | Here is a first attempt at a getbit routine: | ||
+ | |||
+ | < | ||
+ | getbit | ||
+ | asl | ||
+ | bne done | ||
+ | |||
+ | jsr | ||
+ | sta | ||
+ | sec | ||
+ | rol | ||
+ | done | ||
+ | ; The bit is now in C. | ||
+ | rts | ||
+ | </ | ||
+ | |||
+ | In order to read a bit from the shifter, we first perform an ASL. Normally, | ||
+ | this puts one bit of data in the carry flag, while also preparing the shifter | ||
+ | for the next bit. But if the Z flag was set by the ASL, the buffer was in fact | ||
+ | empty, and we shifted out the token bit. In that case, we grab a new byte, | ||
+ | store it in the shifter, and then ROL to get the first data bit from the new | ||
+ | byte. The ROL will also shift in a new token bit. | ||
+ | |||
+ | In practice, it would be slow to call a subroutine in order to fetch new bytes. | ||
+ | After all, this will happen for every eighth bit, which is quite often. Instead | ||
+ | we'll use some self-modifying code, and keep a pointer to the next byte inside | ||
+ | an instruction operand, like this: | ||
+ | |||
+ | < | ||
+ | getbit | ||
+ | asl | ||
+ | bne done | ||
+ | mod_source | ||
+ | ldx | ||
+ | inc | ||
+ | bne | ||
+ | |||
+ | inc | ||
+ | no_new_page | ||
+ | stx | ||
+ | rol | ||
+ | done | ||
+ | ; The bit is now in C. | ||
+ | rts | ||
+ | </ | ||
+ | |||
+ | We're using the X register because we're going to need A for something else | ||
+ | soon. | ||
+ | |||
+ | Note that the SEC has now been removed, because carry is already set from the | ||
+ | previous token bit. If you want to get philosophical about it, you might say | ||
+ | that it's "the same" token bit that gets re-used over and over. | ||
+ | |||
+ | Next, we will rearrange the code to reduce the number of branch-taken penalty | ||
+ | cycles. From now on, we must make sure to CLC before calling getbit. | ||
+ | |||
+ | < | ||
+ | getbyte | ||
+ | mod_source | ||
+ | ldx | ||
+ | inc | ||
+ | bne | ||
+ | |||
+ | inc | ||
+ | no_new_page | ||
+ | stx | ||
+ | getbit | ||
+ | rol | ||
+ | beq | ||
+ | |||
+ | ; The bit is now in C. | ||
+ | rts | ||
+ | </ | ||
+ | |||
+ | ===== From bits to fields ===== | ||
+ | |||
+ | So now we can read individual bits from the stream. Let's pack them together | ||
+ | into bitfields! | ||
+ | |||
+ | We could of course call the getbit routine from a loop: | ||
+ | |||
+ | < | ||
+ | getfield | ||
+ | ; Y contains the requested number of bits | ||
+ | |||
+ | lda #0 | ||
+ | clc | ||
+ | field_loop | ||
+ | jsr | ||
+ | rol | ||
+ | dey | ||
+ | bne | ||
+ | |||
+ | ; The bitfield is now in A. | ||
+ | rts | ||
+ | </ | ||
+ | |||
+ | (This is why we had to preserve the A register during getbit/ | ||
+ | |||
+ | But again, subroutine calls are costly, so we'll merge getfield and getbit into | ||
+ | a single routine. However, getting a single bit is now slower, because we have | ||
+ | to treat it as a field of size one. | ||
+ | |||
+ | < | ||
+ | getbit | ||
+ | ldy #1 | ||
+ | getfield | ||
+ | ; Y contains the requested number of bits | ||
+ | |||
+ | lda #0 | ||
+ | clc | ||
+ | jmp | ||
+ | getbyte | ||
+ | mod_source | ||
+ | ldx | ||
+ | inc | ||
+ | bne | ||
+ | |||
+ | inc | ||
+ | no_new_page | ||
+ | stx | ||
+ | field_loop | ||
+ | rol | ||
+ | beq | ||
+ | |||
+ | rol | ||
+ | dey | ||
+ | bne | ||
+ | |||
+ | rts | ||
+ | </ | ||
+ | |||
+ | Note that, because we clear A at the beginning, we don't have to CLC before | ||
+ | looping back to '' | ||
+ | |||
+ | But we can do better than this! Instead of representing the requested number of | ||
+ | bits as an integer in the Y register, we can represent it as a single 1-bit in | ||
+ | the accumulator. As we shift new data into the accumulator, | ||
+ | closer and closer to the MSB, and when it finally falls off the edge, we | ||
+ | terminate the loop: | ||
+ | |||
+ | < | ||
+ | getbit | ||
+ | lda # | ||
+ | getfield | ||
+ | ; Position of 1-bit in A represents requested number of bits | ||
+ | |||
+ | clc | ||
+ | jmp | ||
+ | getbyte | ||
+ | mod_source | ||
+ | ldx | ||
+ | inc | ||
+ | bne | ||
+ | |||
+ | inc | ||
+ | no_new_page | ||
+ | stx | ||
+ | field_loop | ||
+ | rol | ||
+ | beq | ||
+ | |||
+ | rol | ||
+ | bcc | ||
+ | |||
+ | rts | ||
+ | </ | ||
+ | |||
+ | This preserves Y and saves two cycles per bit (DEY). | ||
+ | |||
+ | ===== Two-stage fields ===== | ||
+ | |||
+ | Given the above routine, we are now in a position to implement the scroller | ||
+ | scenario described in the introduction. Here is some code to fetch a new character from the | ||
+ | bitstream: | ||
+ | |||
+ | < | ||
+ | getchar | ||
+ | jsr | ||
+ | bne large | ||
+ | |||
+ | ; 3-bit character code | ||
+ | lda # | ||
+ | jmp | ||
+ | large | ||
+ | ; 5-bit character code | ||
+ | lda # | ||
+ | jsr | ||
+ | clc | ||
+ | adc #8 | ||
+ | rts | ||
+ | </ | ||
+ | |||
+ | Actually, we can shave off a byte and a pair of cycles by recognising that | ||
+ | getfield always returns with carry set: We can safely omit the CLC and do ADC | ||
+ | #7 instead. | ||
+ | |||
+ | In more complex scenarios, such as decrunchers, | ||
+ | between more than two cases. Perhaps we read two bits in order to select | ||
+ | between four differently-sized encodings: | ||
+ | |||
+ | ^Value range^Coded as ^Value offset (what to add to x)^ | ||
+ | |0-1 |00x |0 | | ||
+ | |2-5 |01xx | ||
+ | |6-21 | ||
+ | |22-149 | ||
+ | |||
+ | Rather than spelling out these four cases as different paths through the code, | ||
+ | we can use a table-based approach. This helps keep down the size of the | ||
+ | decruncher, which is often very important. It will also enable some more | ||
+ | optimisations further down the rabbit hole. | ||
+ | |||
+ | We will use one table for the field widths, and one table for the value | ||
+ | offsets. | ||
+ | |||
+ | < | ||
+ | getvalue | ||
+ | lda # | ||
+ | jsr | ||
+ | tay | ||
+ | lda | ||
+ | jsr | ||
+ | clc | ||
+ | adc | ||
+ | |||
+ | ; 9-bit value returned in A and C. | ||
+ | rts | ||
+ | |||
+ | fields | ||
+ | .byt %10000000 | ||
+ | .byt %01000000 | ||
+ | .byt %00010000 | ||
+ | .byt %00000010 | ||
+ | |||
+ | offsets | ||
+ | .byt 0 | ||
+ | .byt 2 | ||
+ | .byt 6 | ||
+ | .byt 22 | ||
+ | </ | ||
+ | |||
+ | Note that in the example, the maximum value returned is 149. Therefore, rather | ||
+ | than saying that the result is a 9-bit value, we could simply say that the | ||
+ | routine returns with carry undefined, and with an 8-bit result in A. We could | ||
+ | then eliminate the CLC, and compensate by subtracting one from each value in | ||
+ | the offset table. The reason why we can't do this for 9-bit values, is that the | ||
+ | first entry in the offset table would become $ff, and this would cause values 0 | ||
+ | and 1 to instead come out as 256 and 257. | ||
+ | |||
+ | ===== Decoding with arbitrary decision trees ===== | ||
+ | |||
+ | Consider again our scroller example. Suppose we wish to encode a particularly | ||
+ | common character (such as space) using a single bit. We might decide on the | ||
+ | following encoding scheme: | ||
+ | |||
+ | ^Value range^Coded as^Value offset (what to add to x)^ | ||
+ | |0 |0 | ||
+ | |1-8 |10xxx | ||
+ | |9-40 | ||
+ | |||
+ | To fetch a character now, we start by getting a single bit. Based on the value | ||
+ | of this bit, we're either done or we fetch one more bit. Based on this bit, we | ||
+ | then either fetch three or five bits. | ||
+ | |||
+ | This algorithm is essentially a tree of decisions, as illustrated by the | ||
+ | following flowchart: | ||
+ | |||
+ | {{: | ||
+ | |||
+ | We will refer to the rhombus-shaped nodes as //branch nodes// and the rounded-rectangle | ||
+ | nodes as //return nodes//. | ||
+ | |||
+ | Such decision trees are usually implemented explicitly, as code. But for large | ||
+ | trees, the decoder becomes unwieldy. Next up, we'll see how we can represent | ||
+ | decision trees more compactly using tables. | ||
+ | |||
+ | In each node of the flowchart above, we first fetch a bitfield (possibly of | ||
+ | size zero), and then either: | ||
+ | |||
+ | - Branch to a different node, or | ||
+ | - Add a constant and return. | ||
+ | |||
+ | It is time to introduce another decoding trick! So far, the field specifiers | ||
+ | (what we put in A prior to calling getfield) have consisted of a number of | ||
+ | zeros followed by a single set bit. But the remaining bits have no purpose yet, | ||
+ | and they will be available in A when getfield returns, shifted into a position | ||
+ | immediately to the left of the fetched bits. | ||
+ | |||
+ | So, if we call getfield with A set to '' | ||
+ | '' | ||
+ | |||
+ | The most significant bit of the tag will also be in the sign bit of the status | ||
+ | register. Some decrunchers, | ||
+ | value returned is complete, or whether it's just the high-byte of a larger | ||
+ | value. In the latter case, the low-byte can be grabbed very quickly straight | ||
+ | from the byte stream. Essentially, | ||
+ | two cases. | ||
+ | |||
+ | However, in the present technique, we wish to encode a generic decision tree, | ||
+ | and for this we'll have to use more tag bits. | ||
+ | |||
+ | (In the following, the word " | ||
+ | not 6502 branch instructions!) | ||
+ | |||
+ | Suppose we put a number on each node in the flowchart. The current node number | ||
+ | will be kept in the Y register. From this number, we can deduce (using a lookup | ||
+ | table) how many bits to fetch, whether we should branch or return after | ||
+ | fetching, and---in case of a branch node---what range of nodes we should branch | ||
+ | to. | ||
+ | |||
+ | All of this information can be encoded as a single byte, and placed in the | ||
+ | accumulator before calling getfield. As we have already seen, the number of | ||
+ | leading zeros determines the field width. They are followed by a single set bit | ||
+ | and a tag. We will use the most significant tag bit to keep track of what kind | ||
+ | of node we're in. If this bit is clear, we're in a branch node, in which case | ||
+ | the remaining tag bits will be used to encode the range of branch targets. | ||
+ | |||
+ | A separate lookup table, also indexed by the current node number in Y, will be | ||
+ | used to hold the constants that are added in return nodes. | ||
+ | |||
+ | < | ||
+ | decode | ||
+ | ldy # | ||
+ | |||
+ | decode_loop | ||
+ | ; Y represents the current node, and is an index into the field and | ||
+ | ; offset tables. | ||
+ | |||
+ | lda | ||
+ | |||
+ | ; In A, we now have: | ||
+ | ; a number of zero bits, controlling how many bits to fetch | ||
+ | ; a one bit | ||
+ | ; if we are in a return node: | ||
+ | ; a one bit (tag MSB) | ||
+ | ; fill up with zeros | ||
+ | ; if we are in a branch node: | ||
+ | ; a zero bit (tag MSB) | ||
+ | ; tag bits --> first target node (after shift) | ||
+ | |||
+ | ; Special exception to the above: | ||
+ | ; If we're going to fetch a zero-length field, A is zero. | ||
+ | ; Handle that now. | ||
+ | |||
+ | beq done | ||
+ | |||
+ | ; Otherwise, fetch the field. | ||
+ | |||
+ | jsr | ||
+ | |||
+ | ; In A, we now have: | ||
+ | ; a bit indicating whether we are in a branch or return node | ||
+ | ; more tag bits (all zero in case of a return node) | ||
+ | ; the field we just fetched | ||
+ | |||
+ | ; Are we in a return node? | ||
+ | |||
+ | bmi done | ||
+ | |||
+ | ; No, this was a branch node. The branch target is in A. | ||
+ | |||
+ | ; Note that the target has been constructed automatically by | ||
+ | ; concatenating the tag with the fetched bits. So if the tag was 0011 | ||
+ | ; and we fetched 101, we're going to branch to node 0011101. | ||
+ | |||
+ | tay | ||
+ | jmp | ||
+ | done | ||
+ | ; Add constant and return. | ||
+ | |||
+ | clc | ||
+ | adc | ||
+ | |||
+ | rts | ||
+ | fields | ||
+ | .byt %00000000 | ||
+ | .byt %10000001 | ||
+ | .byt %00110000 | ||
+ | .byt %00001100 | ||
+ | .byt %10000000 | ||
+ | |||
+ | offsets | ||
+ | .byt 0 ; Add constant to obtain range 0-0. | ||
+ | .byt 0 ; Unused (branch node) | ||
+ | .byt $80+1 ; Add constant to obtain range 1-8. | ||
+ | .byt $80+9 ; Add constant to obtain range 9-40. | ||
+ | .byt 0 ; Unused (branch node) | ||
+ | </ | ||
+ | |||
+ | A subtlety is that when we return without fetching anything (node 0), the | ||
+ | accumulator will be zero before adding the constant. Otherwise, the accumulator | ||
+ | will be $80, and we have to compensate accordingly in the offset table. | ||
+ | |||
+ | The above code was organised for clarity. However, we can rearrange the loop to | ||
+ | eliminate the JMP instruction. There' | ||
+ | constant Y, as we could just as well load A directly. Since the first node is | ||
+ | always a branch node, we won't be using Y after the fetch, so we can leave it | ||
+ | uninitialised. Hence: | ||
+ | |||
+ | < | ||
+ | decode | ||
+ | lda # | ||
+ | decode_loop | ||
+ | jsr | ||
+ | bmi done | ||
+ | |||
+ | tay | ||
+ | lda | ||
+ | bne | ||
+ | done | ||
+ | clc | ||
+ | adc | ||
+ | |||
+ | rts | ||
+ | |||
+ | fields | ||
+ | .byt %00000000 | ||
+ | .byt %10000001 | ||
+ | .byt %00110000 | ||
+ | .byt %00001100 | ||
+ | |||
+ | offsets | ||
+ | .byt 0 ; Add constant to obtain range 0-0. | ||
+ | .byt 0 ; Unused (branch node) | ||
+ | .byt $80+1 ; Add constant to obtain range 1-8. | ||
+ | .byt $80+9 ; Add constant to obtain range 9-40. | ||
+ | </ | ||
+ | |||
+ | The CLC at '' | ||
+ | from each table entry that corresponds to a return node where a non-zero-sized | ||
+ | field was fetched. | ||
+ | |||
+ | ===== Putting it all together ===== | ||
+ | |||
+ | Cramming an arbitrary decision tree into the field table is all very nifty, and | ||
+ | it keeps down the size of the decoder considerably. But what about performance? | ||
+ | Surely, putting a flowchart in a table can't be faster than simply coding it | ||
+ | with explicit branch instructions? | ||
+ | |||
+ | But as a consequence of the table-driven design, there is now a great | ||
+ | optimisation opportunity staring us in the face: We're down to a single call to | ||
+ | the getfield routine, and that means we can inline it! | ||
+ | |||
+ | < | ||
+ | decode | ||
+ | lda # | ||
+ | clc | ||
+ | jmp | ||
+ | getbyte | ||
+ | mod_source | ||
+ | ldx | ||
+ | inc | ||
+ | bne | ||
+ | |||
+ | inc | ||
+ | no_new_page | ||
+ | stx | ||
+ | decode_loop | ||
+ | rol | ||
+ | beq | ||
+ | |||
+ | rol | ||
+ | bcc | ||
+ | |||
+ | bmi done | ||
+ | |||
+ | tay | ||
+ | lda | ||
+ | clc | ||
+ | bne | ||
+ | done | ||
+ | ; Carry will be set if we got here via the BMI, i.e. after fetching a | ||
+ | ; non-zero-sized field. Compensate in the table. | ||
+ | adc | ||
+ | |||
+ | rts | ||
+ | |||
+ | fields | ||
+ | .byt %00000000 | ||
+ | .byt %10000001 | ||
+ | .byt %00110000 | ||
+ | .byt %00001100 | ||
+ | |||
+ | offsets | ||
+ | .byt 0 ; Add constant to obtain range 0-0 (Carry clear). | ||
+ | .byt 0 ; Unused (branch node) | ||
+ | .byt $7f+1 ; Add constant to obtain range 1-8 (Carry set). | ||
+ | .byt $7f+9 ; Add constant to obtain range 9-40 (Carry set). | ||
+ | </ | ||
+ | |||
+ | Indeed, with such a flexible routine, one might even be able to drive all | ||
+ | decoding from a single call site, and thus to inline the call to the decoder | ||
+ | itself. For a real-world example of this, please have a look at the decruncher | ||
+ | in [[http:// | ||
+ | |||
+ | ===== A final touch ===== | ||
+ | |||
+ | The code is already looking rather streamlined, | ||
+ | more optimisation: | ||
+ | decision tree, by eliminating the CLC right before branching back to | ||
+ | '' | ||
+ | |||
+ | The following trick is only possible if, for each node, the number in the field | ||
+ | table is either zero (for a zero-size fetch) or strictly larger than the node | ||
+ | number. Many decision trees have this property, because node numbers are small | ||
+ | integers, while numbers in the field table tend to be large. If not, it may be | ||
+ | possible to fix it by rearranging the node numbers. | ||
+ | |||
+ | The idea is to access the table a little differently: | ||
+ | from it, we perform an ADC. Naturally, we then have to compensate in the table, | ||
+ | by subtracting from each element the node number (which happens to be in A at | ||
+ | the time of the addition) and 1 (for the carry flag, which is set). | ||
+ | |||
+ | With that, we are ready for the final version of the decoder. It is listed | ||
+ | below in the form of a subroutine, but, as mentioned earlier, it should be | ||
+ | inlined for maximum performance. | ||
+ | |||
+ | < | ||
+ | decode | ||
+ | lda # | ||
+ | clc | ||
+ | jmp | ||
+ | getbyte | ||
+ | mod_source | ||
+ | ldx | ||
+ | inc | ||
+ | bne | ||
+ | |||
+ | inc | ||
+ | no_new_page | ||
+ | stx | ||
+ | decode_loop | ||
+ | rol | ||
+ | beq | ||
+ | |||
+ | rol | ||
+ | bcc | ||
+ | |||
+ | bmi done | ||
+ | |||
+ | tay | ||
+ | adc | ||
+ | bne | ||
+ | done | ||
+ | ; Carry is set. | ||
+ | adc | ||
+ | |||
+ | rts | ||
+ | |||
+ | fields | ||
+ | .byt %00000000-0-1 | ||
+ | .byt %10000001-1-1 | ||
+ | .byt %00110000-2-1 | ||
+ | .byt %00001100-3-1 | ||
+ | |||
+ | offsets | ||
+ | .byt $ff ; Add constant to obtain range 0-0. | ||
+ | .byt 0 ; Unused (branch node) | ||
+ | .byt $7f+1 ; Add constant to obtain range 1-8. | ||
+ | .byt $7f+9 ; Add constant to obtain range 9-40. | ||
+ | </ | ||
+ | |||
+ | ===== Conclusion ===== | ||
+ | |||
+ | We have seen how to extract bitfields from byte sequences stored in RAM, using a highly efficient technique that is capable of navigating arbitrary decision trees as part of the decoding process. |