base:decoding_bitstreams

This shows you the differences between two versions of the page.

Both sides previous revision Previous revision | |||

base:decoding_bitstreams [2015-07-20 11:28] lft |
base:decoding_bitstreams [2015-07-20 11:28] (current) lft [Conclusion] |
||
---|---|---|---|

Line 1: | Line 1: | ||

+ | ====== Decoding bitstreams for fun and profit ====== | ||

+ | by lft | ||

+ | |||

+ | This article describes a technique for extracting bitfields from a long | ||

+ | sequence of bytes stored in RAM. | ||

+ | |||

+ | As an example application, | ||

+ | 5-bit character codes. The entire text could then be stored as a bitstream, | ||

+ | from which you read five bits at a time. But you might save some space if you | ||

+ | represent, say, the eight most common characters as 0xxx, and all other | ||

+ | characters as 1xxxxx. (This would also give you 40 different characters, rather | ||

+ | than 32.) In that case, you'd first want to read a 1-bit field, to | ||

+ | differentiate between the two cases. Then you'd read either a 3-bit field or a | ||

+ | 5-bit field. | ||

+ | |||

+ | We will discuss how to do this efficiently and elegantly on the 6502. In | ||

+ | particular, we will look at a technique that performs the two-stage procedure | ||

+ | described above, and even navigates arbitrary decision trees, as part of its | ||

+ | normal operation. | ||

+ | |||

+ | The schoolbook application for this kind of routine would be a Lempel-Ziv-Welch | ||

+ | decruncher or a Huffman decoder. But anything is possible! For instance, you | ||

+ | could use it to parse note events in a playroutine, | ||

+ | machine, or entropy encoded sound samples. | ||

+ | |||

+ | We will start with a simple design, and then add complexity step by step, also | ||

+ | optimising it to the point where the complete decoder is quite devilish to | ||

+ | follow. | ||

+ | |||

+ | ===== From bytes to bits ===== | ||

+ | |||

+ | At the heart of the bitfield decoder is the // | ||

+ | mini-buffer of pending bits, represented as a single byte in the zero-page. As | ||

+ | we shift out bits, the buffer occasionally becomes empty, at which time a new | ||

+ | byte is loaded into it. The clever part is how we represent the shifter. This | ||

+ | is an old established technique, but it can be rather baffling when you see it | ||

+ | for the first time. | ||

+ | |||

+ | The idea is that the shifter contains (from left to right) 0--7 bits of pending data, | ||

+ | followed by a single 1-bit that we'll refer to as the token, followed by zeros. | ||

+ | So, the following shifter contains three bits of data (1, 0, 1): | ||

+ | |||

+ | {{: | ||

+ | |||

+ | At program start, the shifter is initialised to $80. | ||

+ | |||

+ | Here is a first attempt at a getbit routine: | ||

+ | |||

+ | < | ||

+ | getbit | ||

+ | asl | ||

+ | bne done | ||

+ | |||

+ | jsr | ||

+ | sta | ||

+ | sec | ||

+ | rol | ||

+ | done | ||

+ | ; The bit is now in C. | ||

+ | rts | ||

+ | </ | ||

+ | |||

+ | In order to read a bit from the shifter, we first perform an ASL. Normally, | ||

+ | this puts one bit of data in the carry flag, while also preparing the shifter | ||

+ | for the next bit. But if the Z flag was set by the ASL, the buffer was in fact | ||

+ | empty, and we shifted out the token bit. In that case, we grab a new byte, | ||

+ | store it in the shifter, and then ROL to get the first data bit from the new | ||

+ | byte. The ROL will also shift in a new token bit. | ||

+ | |||

+ | In practice, it would be slow to call a subroutine in order to fetch new bytes. | ||

+ | After all, this will happen for every eighth bit, which is quite often. Instead | ||

+ | we'll use some self-modifying code, and keep a pointer to the next byte inside | ||

+ | an instruction operand, like this: | ||

+ | |||

+ | < | ||

+ | getbit | ||

+ | asl | ||

+ | bne done | ||

+ | mod_source | ||

+ | ldx | ||

+ | inc | ||

+ | bne | ||

+ | |||

+ | inc | ||

+ | no_new_page | ||

+ | stx | ||

+ | rol | ||

+ | done | ||

+ | ; The bit is now in C. | ||

+ | rts | ||

+ | </ | ||

+ | |||

+ | We're using the X register because we're going to need A for something else | ||

+ | soon. | ||

+ | |||

+ | Note that the SEC has now been removed, because carry is already set from the | ||

+ | previous token bit. If you want to get philosophical about it, you might say | ||

+ | that it's "the same" token bit that gets re-used over and over. | ||

+ | |||

+ | Next, we will rearrange the code to reduce the number of branch-taken penalty | ||

+ | cycles. From now on, we must make sure to CLC before calling getbit. | ||

+ | |||

+ | < | ||

+ | getbyte | ||

+ | mod_source | ||

+ | ldx | ||

+ | inc | ||

+ | bne | ||

+ | |||

+ | inc | ||

+ | no_new_page | ||

+ | stx | ||

+ | getbit | ||

+ | rol | ||

+ | beq | ||

+ | |||

+ | ; The bit is now in C. | ||

+ | rts | ||

+ | </ | ||

+ | |||

+ | ===== From bits to fields ===== | ||

+ | |||

+ | So now we can read individual bits from the stream. Let's pack them together | ||

+ | into bitfields! | ||

+ | |||

+ | We could of course call the getbit routine from a loop: | ||

+ | |||

+ | < | ||

+ | getfield | ||

+ | ; Y contains the requested number of bits | ||

+ | |||

+ | lda #0 | ||

+ | clc | ||

+ | field_loop | ||

+ | jsr | ||

+ | rol | ||

+ | dey | ||

+ | bne | ||

+ | |||

+ | ; The bitfield is now in A. | ||

+ | rts | ||

+ | </ | ||

+ | |||

+ | (This is why we had to preserve the A register during getbit/ | ||

+ | |||

+ | But again, subroutine calls are costly, so we'll merge getfield and getbit into | ||

+ | a single routine. However, getting a single bit is now slower, because we have | ||

+ | to treat it as a field of size one. | ||

+ | |||

+ | < | ||

+ | getbit | ||

+ | ldy #1 | ||

+ | getfield | ||

+ | ; Y contains the requested number of bits | ||

+ | |||

+ | lda #0 | ||

+ | clc | ||

+ | jmp | ||

+ | getbyte | ||

+ | mod_source | ||

+ | ldx | ||

+ | inc | ||

+ | bne | ||

+ | |||

+ | inc | ||

+ | no_new_page | ||

+ | stx | ||

+ | field_loop | ||

+ | rol | ||

+ | beq | ||

+ | |||

+ | rol | ||

+ | dey | ||

+ | bne | ||

+ | |||

+ | rts | ||

+ | </ | ||

+ | |||

+ | Note that, because we clear A at the beginning, we don't have to CLC before | ||

+ | looping back to '' | ||

+ | |||

+ | But we can do better than this! Instead of representing the requested number of | ||

+ | bits as an integer in the Y register, we can represent it as a single 1-bit in | ||

+ | the accumulator. As we shift new data into the accumulator, | ||

+ | closer and closer to the MSB, and when it finally falls off the edge, we | ||

+ | terminate the loop: | ||

+ | |||

+ | < | ||

+ | getbit | ||

+ | lda # | ||

+ | getfield | ||

+ | ; Position of 1-bit in A represents requested number of bits | ||

+ | |||

+ | clc | ||

+ | jmp | ||

+ | getbyte | ||

+ | mod_source | ||

+ | ldx | ||

+ | inc | ||

+ | bne | ||

+ | |||

+ | inc | ||

+ | no_new_page | ||

+ | stx | ||

+ | field_loop | ||

+ | rol | ||

+ | beq | ||

+ | |||

+ | rol | ||

+ | bcc | ||

+ | |||

+ | rts | ||

+ | </ | ||

+ | |||

+ | This preserves Y and saves two cycles per bit (DEY). | ||

+ | |||

+ | ===== Two-stage fields ===== | ||

+ | |||

+ | Given the above routine, we are now in a position to implement the scroller | ||

+ | scenario described in the introduction. Here is some code to fetch a new character from the | ||

+ | bitstream: | ||

+ | |||

+ | < | ||

+ | getchar | ||

+ | jsr | ||

+ | bne large | ||

+ | |||

+ | ; 3-bit character code | ||

+ | lda # | ||

+ | jmp | ||

+ | large | ||

+ | ; 5-bit character code | ||

+ | lda # | ||

+ | jsr | ||

+ | clc | ||

+ | adc #8 | ||

+ | rts | ||

+ | </ | ||

+ | |||

+ | Actually, we can shave off a byte and a pair of cycles by recognising that | ||

+ | getfield always returns with carry set: We can safely omit the CLC and do ADC | ||

+ | #7 instead. | ||

+ | |||

+ | In more complex scenarios, such as decrunchers, | ||

+ | between more than two cases. Perhaps we read two bits in order to select | ||

+ | between four differently-sized encodings: | ||

+ | |||

+ | ^Value range^Coded as ^Value offset (what to add to x)^ | ||

+ | |0-1 |00x |0 | | ||

+ | |2-5 |01xx | ||

+ | |6-21 | ||

+ | |22-149 | ||

+ | |||

+ | Rather than spelling out these four cases as different paths through the code, | ||

+ | we can use a table-based approach. This helps keep down the size of the | ||

+ | decruncher, which is often very important. It will also enable some more | ||

+ | optimisations further down the rabbit hole. | ||

+ | |||

+ | We will use one table for the field widths, and one table for the value | ||

+ | offsets. | ||

+ | |||

+ | < | ||

+ | getvalue | ||

+ | lda # | ||

+ | jsr | ||

+ | tay | ||

+ | lda | ||

+ | jsr | ||

+ | clc | ||

+ | adc | ||

+ | |||

+ | ; 9-bit value returned in A and C. | ||

+ | rts | ||

+ | |||

+ | fields | ||

+ | .byt %10000000 | ||

+ | .byt %01000000 | ||

+ | .byt %00010000 | ||

+ | .byt %00000010 | ||

+ | |||

+ | offsets | ||

+ | .byt 0 | ||

+ | .byt 2 | ||

+ | .byt 6 | ||

+ | .byt 22 | ||

+ | </ | ||

+ | |||

+ | Note that in the example, the maximum value returned is 149. Therefore, rather | ||

+ | than saying that the result is a 9-bit value, we could simply say that the | ||

+ | routine returns with carry undefined, and with an 8-bit result in A. We could | ||

+ | then eliminate the CLC, and compensate by subtracting one from each value in | ||

+ | the offset table. The reason why we can't do this for 9-bit values, is that the | ||

+ | first entry in the offset table would become $ff, and this would cause values 0 | ||

+ | and 1 to instead come out as 256 and 257. | ||

+ | |||

+ | ===== Decoding with arbitrary decision trees ===== | ||

+ | |||

+ | Consider again our scroller example. Suppose we wish to encode a particularly | ||

+ | common character (such as space) using a single bit. We might decide on the | ||

+ | following encoding scheme: | ||

+ | |||

+ | ^Value range^Coded as^Value offset (what to add to x)^ | ||

+ | |0 |0 | ||

+ | |1-8 |10xxx | ||

+ | |9-40 | ||

+ | |||

+ | To fetch a character now, we start by getting a single bit. Based on the value | ||

+ | of this bit, we're either done or we fetch one more bit. Based on this bit, we | ||

+ | then either fetch three or five bits. | ||

+ | |||

+ | This algorithm is essentially a tree of decisions, as illustrated by the | ||

+ | following flowchart: | ||

+ | |||

+ | {{: | ||

+ | |||

+ | We will refer to the rhombus-shaped nodes as //branch nodes// and the rounded-rectangle | ||

+ | nodes as //return nodes//. | ||

+ | |||

+ | Such decision trees are usually implemented explicitly, as code. But for large | ||

+ | trees, the decoder becomes unwieldy. Next up, we'll see how we can represent | ||

+ | decision trees more compactly using tables. | ||

+ | |||

+ | In each node of the flowchart above, we first fetch a bitfield (possibly of | ||

+ | size zero), and then either: | ||

+ | |||

+ | - Branch to a different node, or | ||

+ | - Add a constant and return. | ||

+ | |||

+ | It is time to introduce another decoding trick! So far, the field specifiers | ||

+ | (what we put in A prior to calling getfield) have consisted of a number of | ||

+ | zeros followed by a single set bit. But the remaining bits have no purpose yet, | ||

+ | and they will be available in A when getfield returns, shifted into a position | ||

+ | immediately to the left of the fetched bits. | ||

+ | |||

+ | So, if we call getfield with A set to '' | ||

+ | '' | ||

+ | |||

+ | The most significant bit of the tag will also be in the sign bit of the status | ||

+ | register. Some decrunchers, | ||

+ | value returned is complete, or whether it's just the high-byte of a larger | ||

+ | value. In the latter case, the low-byte can be grabbed very quickly straight | ||

+ | from the byte stream. Essentially, | ||

+ | two cases. | ||

+ | |||

+ | However, in the present technique, we wish to encode a generic decision tree, | ||

+ | and for this we'll have to use more tag bits. | ||

+ | |||

+ | (In the following, the word " | ||

+ | not 6502 branch instructions!) | ||

+ | |||

+ | Suppose we put a number on each node in the flowchart. The current node number | ||

+ | will be kept in the Y register. From this number, we can deduce (using a lookup | ||

+ | table) how many bits to fetch, whether we should branch or return after | ||

+ | fetching, and---in case of a branch node---what range of nodes we should branch | ||

+ | to. | ||

+ | |||

+ | All of this information can be encoded as a single byte, and placed in the | ||

+ | accumulator before calling getfield. As we have already seen, the number of | ||

+ | leading zeros determines the field width. They are followed by a single set bit | ||

+ | and a tag. We will use the most significant tag bit to keep track of what kind | ||

+ | of node we're in. If this bit is clear, we're in a branch node, in which case | ||

+ | the remaining tag bits will be used to encode the range of branch targets. | ||

+ | |||

+ | A separate lookup table, also indexed by the current node number in Y, will be | ||

+ | used to hold the constants that are added in return nodes. | ||

+ | |||

+ | < | ||

+ | decode | ||

+ | ldy # | ||

+ | |||

+ | decode_loop | ||

+ | ; Y represents the current node, and is an index into the field and | ||

+ | ; offset tables. | ||

+ | |||

+ | lda | ||

+ | |||

+ | ; In A, we now have: | ||

+ | ; a number of zero bits, controlling how many bits to fetch | ||

+ | ; a one bit | ||

+ | ; if we are in a return node: | ||

+ | ; a one bit (tag MSB) | ||

+ | ; fill up with zeros | ||

+ | ; if we are in a branch node: | ||

+ | ; a zero bit (tag MSB) | ||

+ | ; tag bits --> first target node (after shift) | ||

+ | |||

+ | ; Special exception to the above: | ||

+ | ; If we're going to fetch a zero-length field, A is zero. | ||

+ | ; Handle that now. | ||

+ | |||

+ | beq done | ||

+ | |||

+ | ; Otherwise, fetch the field. | ||

+ | |||

+ | jsr | ||

+ | |||

+ | ; In A, we now have: | ||

+ | ; a bit indicating whether we are in a branch or return node | ||

+ | ; more tag bits (all zero in case of a return node) | ||

+ | ; the field we just fetched | ||

+ | |||

+ | ; Are we in a return node? | ||

+ | |||

+ | bmi done | ||

+ | |||

+ | ; No, this was a branch node. The branch target is in A. | ||

+ | |||

+ | ; Note that the target has been constructed automatically by | ||

+ | ; concatenating the tag with the fetched bits. So if the tag was 0011 | ||

+ | ; and we fetched 101, we're going to branch to node 0011101. | ||

+ | |||

+ | tay | ||

+ | jmp | ||

+ | done | ||

+ | ; Add constant and return. | ||

+ | |||

+ | clc | ||

+ | adc | ||

+ | |||

+ | rts | ||

+ | fields | ||

+ | .byt %00000000 | ||

+ | .byt %10000001 | ||

+ | .byt %00110000 | ||

+ | .byt %00001100 | ||

+ | .byt %10000000 | ||

+ | |||

+ | offsets | ||

+ | .byt 0 ; Add constant to obtain range 0-0. | ||

+ | .byt 0 ; Unused (branch node) | ||

+ | .byt $80+1 ; Add constant to obtain range 1-8. | ||

+ | .byt $80+9 ; Add constant to obtain range 9-40. | ||

+ | .byt 0 ; Unused (branch node) | ||

+ | </ | ||

+ | |||

+ | A subtlety is that when we return without fetching anything (node 0), the | ||

+ | accumulator will be zero before adding the constant. Otherwise, the accumulator | ||

+ | will be $80, and we have to compensate accordingly in the offset table. | ||

+ | |||

+ | The above code was organised for clarity. However, we can rearrange the loop to | ||

+ | eliminate the JMP instruction. There' | ||

+ | constant Y, as we could just as well load A directly. Since the first node is | ||

+ | always a branch node, we won't be using Y after the fetch, so we can leave it | ||

+ | uninitialised. Hence: | ||

+ | |||

+ | < | ||

+ | decode | ||

+ | lda # | ||

+ | decode_loop | ||

+ | jsr | ||

+ | bmi done | ||

+ | |||

+ | tay | ||

+ | lda | ||

+ | bne | ||

+ | done | ||

+ | clc | ||

+ | adc | ||

+ | |||

+ | rts | ||

+ | |||

+ | fields | ||

+ | .byt %00000000 | ||

+ | .byt %10000001 | ||

+ | .byt %00110000 | ||

+ | .byt %00001100 | ||

+ | |||

+ | offsets | ||

+ | .byt 0 ; Add constant to obtain range 0-0. | ||

+ | .byt 0 ; Unused (branch node) | ||

+ | .byt $80+1 ; Add constant to obtain range 1-8. | ||

+ | .byt $80+9 ; Add constant to obtain range 9-40. | ||

+ | </ | ||

+ | |||

+ | The CLC at '' | ||

+ | from each table entry that corresponds to a return node where a non-zero-sized | ||

+ | field was fetched. | ||

+ | |||

+ | ===== Putting it all together ===== | ||

+ | |||

+ | Cramming an arbitrary decision tree into the field table is all very nifty, and | ||

+ | it keeps down the size of the decoder considerably. But what about performance? | ||

+ | Surely, putting a flowchart in a table can't be faster than simply coding it | ||

+ | with explicit branch instructions? | ||

+ | |||

+ | But as a consequence of the table-driven design, there is now a great | ||

+ | optimisation opportunity staring us in the face: We're down to a single call to | ||

+ | the getfield routine, and that means we can inline it! | ||

+ | |||

+ | < | ||

+ | decode | ||

+ | lda # | ||

+ | clc | ||

+ | jmp | ||

+ | getbyte | ||

+ | mod_source | ||

+ | ldx | ||

+ | inc | ||

+ | bne | ||

+ | |||

+ | inc | ||

+ | no_new_page | ||

+ | stx | ||

+ | decode_loop | ||

+ | rol | ||

+ | beq | ||

+ | |||

+ | rol | ||

+ | bcc | ||

+ | |||

+ | bmi done | ||

+ | |||

+ | tay | ||

+ | lda | ||

+ | clc | ||

+ | bne | ||

+ | done | ||

+ | ; Carry will be set if we got here via the BMI, i.e. after fetching a | ||

+ | ; non-zero-sized field. Compensate in the table. | ||

+ | adc | ||

+ | |||

+ | rts | ||

+ | |||

+ | fields | ||

+ | .byt %00000000 | ||

+ | .byt %10000001 | ||

+ | .byt %00110000 | ||

+ | .byt %00001100 | ||

+ | |||

+ | offsets | ||

+ | .byt 0 ; Add constant to obtain range 0-0 (Carry clear). | ||

+ | .byt 0 ; Unused (branch node) | ||

+ | .byt $7f+1 ; Add constant to obtain range 1-8 (Carry set). | ||

+ | .byt $7f+9 ; Add constant to obtain range 9-40 (Carry set). | ||

+ | </ | ||

+ | |||

+ | Indeed, with such a flexible routine, one might even be able to drive all | ||

+ | decoding from a single call site, and thus to inline the call to the decoder | ||

+ | itself. For a real-world example of this, please have a look at the decruncher | ||

+ | in [[http:// | ||

+ | |||

+ | ===== A final touch ===== | ||

+ | |||

+ | The code is already looking rather streamlined, | ||

+ | more optimisation: | ||

+ | decision tree, by eliminating the CLC right before branching back to | ||

+ | '' | ||

+ | |||

+ | The following trick is only possible if, for each node, the number in the field | ||

+ | table is either zero (for a zero-size fetch) or strictly larger than the node | ||

+ | number. Many decision trees have this property, because node numbers are small | ||

+ | integers, while numbers in the field table tend to be large. If not, it may be | ||

+ | possible to fix it by rearranging the node numbers. | ||

+ | |||

+ | The idea is to access the table a little differently: | ||

+ | from it, we perform an ADC. Naturally, we then have to compensate in the table, | ||

+ | by subtracting from each element the node number (which happens to be in A at | ||

+ | the time of the addition) and 1 (for the carry flag, which is set). | ||

+ | |||

+ | With that, we are ready for the final version of the decoder. It is listed | ||

+ | below in the form of a subroutine, but, as mentioned earlier, it should be | ||

+ | inlined for maximum performance. | ||

+ | |||

+ | < | ||

+ | decode | ||

+ | lda # | ||

+ | clc | ||

+ | jmp | ||

+ | getbyte | ||

+ | mod_source | ||

+ | ldx | ||

+ | inc | ||

+ | bne | ||

+ | |||

+ | inc | ||

+ | no_new_page | ||

+ | stx | ||

+ | decode_loop | ||

+ | rol | ||

+ | beq | ||

+ | |||

+ | rol | ||

+ | bcc | ||

+ | |||

+ | bmi done | ||

+ | |||

+ | tay | ||

+ | adc | ||

+ | bne | ||

+ | done | ||

+ | ; Carry is set. | ||

+ | adc | ||

+ | |||

+ | rts | ||

+ | |||

+ | fields | ||

+ | .byt %00000000-0-1 | ||

+ | .byt %10000001-1-1 | ||

+ | .byt %00110000-2-1 | ||

+ | .byt %00001100-3-1 | ||

+ | |||

+ | offsets | ||

+ | .byt $ff ; Add constant to obtain range 0-0. | ||

+ | .byt 0 ; Unused (branch node) | ||

+ | .byt $7f+1 ; Add constant to obtain range 1-8. | ||

+ | .byt $7f+9 ; Add constant to obtain range 9-40. | ||

+ | </ | ||

+ | |||

+ | ===== Conclusion ===== | ||

+ | |||

+ | We have seen how to extract bitfields from byte sequences stored in RAM, using a highly efficient technique that is capable of navigating arbitrary decision trees as part of the decoding process. |

base/decoding_bitstreams.txt ยท Last modified: 2015-07-20 11:28 by lft