User Tools

Site Tools



This shows you the differences between two versions of the page.

Link to this comparison view

magazines:chacking17 [2015-04-17 04:34] (current)
Line 1: Line 1:
 +                   ########​
 +             ##################​
 +         ###### ​           ######
 +      #####
 +    #####  ####  ####      ##      #####   #### ​ ####  ####  ####  ####   #####
 +  #####    ##    ##      ####    ##   ## ​  ## ​ ###     ## ​   ####  ##   ## ​  ##
 + ##### ​   ######## ​    ## ​ ##   ## ​       #####       ## ​   ## ## ##   ##
 +#####    ##    ##    ######## ​ ##   ## ​  ## ​ ###     ## ​   ##  ####   ## ​  ##
 +#####  ####  ####  ####  ####  #####   #### ​ ####  ####  ####  ####   ######​
 +#####                                                                    ##
 + ###### ​           ###### ​          Issue #17
 +   ################## ​           November 15, 1998
 +       ########​
 + "The words of the wise are as goads, and as nails fastened by
 + the masters of assemblies." ​ Ecclesiastes 12
 + "​Before criticizing a man I try to walk a mile in his shoes. ​ That
 + way, if he gets mad he's a mile away and barefoot." ​ John Ianetta
 + For me, fall is a time for reflection. ​ The trees descend into their 
 +golden time and then seem to die.  And yet, under the surface they are quite 
 +alive, and teeming with activity at a smaller, less-visible scale, waiting to 
 +burst forwards again in full bloom. ​ I think there'​s a great metaphor for C= 
 +in this.  But I have no idea what it is.
 + In fact things are totally hectic around here, and I haven'​t given
 +more than a few moments thought towards the 64, so this will be a mighty
 +short editorial. ​ Between a PhD thesis and begging for jobs there hasn't been
 +much 64-time, but with a little here and a little there this issue is finally
 +squeaking out.  Everybody worked hard over the summer, and my goal was
 +to get it out in September. ​ Well, you know, these days, if something in
 +the 64 world is only two months late it's doing pretty good, so no big whoop.
 +C=Hacking ought to appear reguarly after December, though.
 + On the down side, some things, such as the challenge problem from
 +last time, will have to wait until the next issue. ​ I also stayed pretty
 +low-key while putting this issue together, but future issues will be more 
 +public in soliciting articles (e.g. on comp.sys.cbm).
 + In other news, a C64 C-compiler has finally appeared! ​ This outstanding
 +effort is the work of Ullrich von Bassewitz,,​ the force 
 +behind Elite128 among other things. ​ The cc65 webpage is at
 +        http://​​uz/​cc65/​
 +so have a look, and tell Ullrich what a stellar guy he is :).  Also, as
 +most of you know, the Chicago Expo took place on October 24, and was a real
 +hoot!  Check out 
 + http://​​
 +for some nice pictures from the Expo, taken by Mark Seelye.
 + Meanwhile, sit back, relax, and enjoy these latest musings from 
 +these, our masters of assembly.
 +.                                    C=H #17
 +:::::::::::::::::::::::::::::::::::​ Contents ::::::::::::::::::::::::::::::::::​
 + o Voluminous ruminations from your unfettered editor.
 + o Is it a bug or a feature?
 +The C=Hallenge
 + o To be continued...
 +Side Hacking
 + o "​SuperCPU Software Repair",​ by S. Judd <​>​.
 +    An amateur'​s excursion into correcting errant wares.
 +Main Articles
 + o "An Optimizing Hybrid LZ77 RLE Data Compression Program, aka
 +   Improving Compression Ratio for Low-Resource Decompression",​
 +   by Pasi '​Albert'​ Ojala <​>​
 +   Part two of a two-part article on data compression,​ giving a
 +   detailed description of the compression algorithms used in 
 +   pucrunch, not to mention the decompression code.
 + o "​VIC-20 Kernel ROM Disassembly Project",​ by Richard Cini
 +   <​>​
 +   This is the first in a series of articles which aims to
 +   present a complete, commented disassembly of the VIC-20
 +   ROMs.
 + o Masters Class: "​NTSC/​PAL fixing, part I", by Russell Reed 
 +   <​>,​ Robin Harbron <​>,​ and S. Judd.
 +   Sit up straight and pay attention. ​ In the Masters Class, a
 +   Commodore luminary attempts to instruct a couple of ignorant
 +   plebians in his art.  In this case, Robin and I set out to
 +   learn NTSC/PAL fixing from one of the greats, Decomp/​Style.
 +   Our first fix, a demo from the obscure Finnish group Pu-239,
 +   is included, along with detailed descriptions of our experiences.
 + o "The Herd Mentality",​ by Bil Herd <​>​
 +   This is a collection of entertaining musings on Commodore and the 
 +   development of the C128, as provided by Bil Herd (and that's no
 +   bull). ​ If you don't know who Bil Herd _is_, why not type
 +   SYS 32800,​123,​45,​6 on a 128 sometime...
 +   ​
 +.................................. Credits ...................................
 +Editor, The Big Kahuna, The Car'​a'​carn..... Stephen L. Judd
 +C=Hacking logo by.......................... Mark Lawrence
 +Special thanks to Marko Makela, Olaf Seibert, and the rest of the cbm-hackers
 +for their many otherwise unacknowledged contributions.
 +Legal disclaimer:
 + 1) If you screw it up it's your own damn fault!  ​
 + 2) If you use someone'​s stuff without permission you're a dork!
 +For information on the mailing list, ftp and web sites, send some email
 +About the authors:
 +Pasi '​Albert'​ Ojala is a 29 year old software engineer, currently
 +working at a VLSI design company on a RISC DSP core C compiler.
 +Around 1984 a friend introduced him to the VIC-20, and a couple
 +of years later he bought a 64+1541 to replace a broken Spectrum48K.
 +He began writing his own BBS, using ML routines for speed, and
 +later wrote a series of demos under the Pu-239 label. ​ In addition
 +to pucrunch and his many C=Hacking articles, Pasi was written an Amiga
 +1581 filesystem, a graphics conversion package, a C64 burstloader,​ and a 
 +number of demos. ​ He is currently uses his 64 for hobbyist pursuits, and
 +is contemplating multipart demos for the 64 and the VIC-20, in addition
 +to future C=Hacking articles. ​ Pasi is also a huge Babylon-5 fan, and
 +has a B5 quote page at http://​​~albert/​Quotes/​B5-quotes.html
 +Richard Cini is a 31 year old senior loan underwriter who first became
 +involved with Commodore 8-bits in 1981, when his parents bought him a
 +VIC-20 as a birthday present. ​ Mostly he used it for general BASIC
 +programming,​ with some ML later on, for projects such as controlling
 +the lawn sprinkler system, and for a text-to-speech synthesyzer. ​ All 
 +his CBM stuff is packed up right now, along with his other "​classic" ​
 +computers, including a PDP11/34 and a KIM-1. ​ In addition to collecting
 +old computers Richard enjoys gardening, golf, and recently has gotten
 +interested in robotics. ​ As to the C= community, he feels that it
 +is unique in being fiercely loyal without being evangelical,​ unlike
 +some other communities,​ while being extremely creative in making the 
 +best use out of the 64.
 +Robin Harbron is a 26 year old internet tech support at a local
 +independent phone company. ​ He first got involved with C= 8-bits
 +in 1980, playing with school PETs, and in 1983 his parents convinced
 +him to spend the extra money on a C64 instead of getting a VIC-20.
 +Like most of us he played a lot of games, typed in games out of
 +magazines, and tried to write his own games. ​ Now he writes demos,
 +dabbles with Internet stuff, writes C= magazine articles, and, yes,
 +plays games. ​ He is currently working on a few demos and a few games,
 +as well as the "​in-progress-but-sometimes-stalled-for-a-real-long-time-
 +until-inspiration-hits-again Internet stuff"​. ​ He is also working on
 +raising a family, and enjoys music (particularly playing bass and guitar), ​
 +church, ice hockey and cricket, and classic video games.
 +................................... Jiffies ..................................
 +by the cbm.hackers
 + Everybody knows the old REM shift-L trick in BASIC 2.0, which
 +generates a syntax error upon listing. ​ But why does it work?  The
 +answer turns out to be quite interesting.
 + Normally, when the BASIC interpreter tokenizes a line it strips ​
 +out characters with the high bit set.  One exception is characters within
 +quotes; the other is characters after REM.  In those cases, characters
 +are embedded literally into the program line.
 + Now, BASIC tokens all have the high bit set.  When LIST
 +encounters a character with the high bit set, it checks whether
 +it is in quote mode.  If it is, the character is outputted as normal.
 +If not, the character is treated as a token, which is expanded by
 +QPLOP (located at $A717). ​ The part of QPLOP which prints keywords looks 
 +like this:
 +:​LOOP1 ​  ​DEX ​             ;Traverse the keyword table
 +         BEQ :PLOOP
 +:​LOOP2 ​  ​INY ​             ;read a keyword
 +         LDA RESLST,Y
 +         BPL :LOOP2
 +         BMI :LOOP1
 +:​PLOOP ​  ​INY ​             ;Print out the keyword
 +         LDA RESLST,Y
 +         BMI LISTENT1 ​    ;Exit if on last char
 +         JSR $AB47        ;Print char, AND #$FF
 +         BNE :PLOOP
 +The keyword strings in RESLST all stored dextral character inverted
 +(the last char has the high bit set), and the above code just moves 
 +forward through the table until it has counted out .X keywords. ​ At
 +that point, :PLOOP prints out the next keyword to the screen, and
 +hops back into LIST.
 + Shift-L is character code 204, or $CC.  When LIST sees this
 +inside of a REM, it sees the high bit set and de-tokenizes it.
 +As it turns out, though, the last token is $CB, which is keyword GO
 +(so that GO TO works). ​ It also turns out that RESLST, the list of
 +BASIC keywords, uses 255 characters. ​ The 256th character is zero
 +(value zero, not character zero).
 + So, the above code goes through the list, and then prints
 +out token $CC, the first character of which is a null.  This zero
 +is sent to $AB47. ​ $AB47 sends it to JSR $FFD2 (which does nothing
 +with the character), performs an AND #$FF, and exits. ​ But that makes 
 +the BNE :PLOOP branch _not_ get taken, and the code erroneously moves 
 +forwards into... ​ the code which executes the FOR statement!
 + And the first thing FOR does is look for a valid variable.
 +When you LIST a program, the character immediately following the
 +LIST is a statement terminator (a colon : or else an end of line null).
 +LET (which is called by FOR) reads this character, decides it's an
 +invalid variable name, and generates a ?​SYNTAX ​ ERROR.
 + Because QPLOP uses absolute addressing (LDA RESLST,Y), .Y
 +can wrap around through 255 to traverse the list again. ​ This is
 +why shift-M shift-N etc. all generate valid keywords. ​ Only shift-L
 +will force an error, and it is all due to the zero in the keyword
 + Similar things can happen in other BASICs. ​ In BASIC 4.0,
 +token $DB does the trick. ​ In BASIC 1.0, $CB ought to do it.
 +The problem seems to have been fixed in BASIC 7.0; at least the
 +trick doesn'​t seem to work on a 128.
 + Finally, like most things on the 64, embedding tokens into
 +REM statements can naturally be put to some creative use.  For example,
 +RUN once ran a contest where readers submitted stories and riddles
 +and such, which were read by LISTing the program. ​ They were very clever
 +and entertaining,​ and I close this summary with the one I've remembered ​
 +since high school:
 +20 REM {C=-V}T A {INST CTRL-0}E
 +............................... The C=Hallenge ...............................
 +Wait until next time!
 +................................ Side Hacking ................................
 + SuperCPU software repair
 + --------------------------->​ by S. L. Judd
 + One of the feature articles in this issue deals with NTSC/PAL
 +fixing. ​ But have you ever thought about SCPU fixing? ​ You know how
 +it goes: you have that program that could really benefit from
 +the speed boost, but doesn'​t work, and usually because of some silly
 +little thing.
 + Well, it really bugs me to have programs not like my nice
 +hardware for dumb reasons, so I decided I would try my hand at fixing ​
 +up some programs. ​ The one that really did it for me was the game 
 +"Stunt Car Racer" -- I had never played it before, but after getting ​
 +ahold of it it was clear that here was a game that would be just great 
 +with a SuperCPU. ​ I had never done something like this before, but it 
 +seemed a doable problem and so I jumped in head first, and this article
 +sums up my inexpert experience to date.
 + By the way, stuntcar-scpu is totally cool :).
 + To date I have fixed up just three games: Stunt Car Racer, Rescue
 +on Fractalus, and Stellar 7.  My goal was really to "​CMD-fix"​ these programs,
 +to make them run off of my FD-2000 as well as my SCPU.  Although these are 
 +all games, the techniques should apply equally well to application programs
 +with a bad attitude. ​ Before discussing the fixes, it is probably worthwhile
 +to discuss a few generalities.
 + I also note that programmers who don't have a SuperCPU might find
 +some of this information helpful in designing their programs to work with
 + Finally, my fixes are available in the Fridge.
 +Tools and Process
 +The tools I used were:
 +  o Action Replay
 +  o Wits
 +  o Paper for taking notes (backs of receipts/​envelopes work)
 +I think this is all that is necessary, although a good sector editor
 +can come in handy for certain things.
 +After trying a number of different approaches to the problem, the process
 +I've settled on goes roughly like the following:
 + - Have an idea of what will need fixing
 + - Familiarize yourself with the program
 + - Track down the things that need fixing
 + - Figure out free areas of memory
 + - Apply patches, and test
 + Most programs work in more or less the same way: there are
 +some initialization routines, there'​s a main loop, and there'​s an 
 +interrupt routine or series of routines. ​ The interrupts are easy to 
 +find, via the vectors at either $FFFA or at $0314 and friends. ​ The 
 +initialization routine can be tougher, but can be deduced from 
 +the loader or decompressor;​ also, some programs point the NMI vector to 
 +the reset code, so that RESTORE restarts the program. ​ Finding the
 +things that need fixing usually involves freezing the program at the
 +appropriate time, and doing a little disassembly. ​ Sometimes a hunt for
 +things like LDA $DC01 is helpful, too.  Figuring out free areas of
 +memory is easy, by either getting a good feel for the program, or
 +filling some target memory with a fill byte and then checking it
 +later, to see if it was overwritten. ​ Once the patch works on the 64,
 +all that remains is to test it on the SCPU, and it's all done!
 + It seems to me that, at the fundamental level, the SCPU is different ​
 +from a stock machine in three basic ways: it is a 65816, it runs at 20MHz, ​
 +and it has hardware registers/​different configurations. ​ There are also 
 +some strange and mysterious problems that can arise.
 + All possible opcodes are defined on the 65816, which means that
 +"​illegal"​ or quasi-opcodes will not work correctly. ​ On the 65xx chips, the 
 +quasi-opcodes aren't real opcodes -- they are like little holes in the cpu, 
 +and things going through those holes fall through different parts of the 
 +normal opcode circuitry. ​ Although used by very few programs, a number of 
 +copy protection schemes make use of them, so sometimes the program works fine
 +with a SCPU but the copy protection makes it choke -- how very annoying ​
 +(example: EA's Lords of Conquest). ​ Naturally, disk-based protection methods ​
 +mean it won't work on an FD-2000, either.
 + Running at 20Mhz makes all sorts of problems. ​ Any kind of software
 +loop will run too fast -- delay loops, countdown loops, input busy-loops,
 +etc.  Also main program loops, so that the game runs unplayably fast
 +(most 3D games never had to worry about being too fast). ​ It can also
 +lead to flickering screens, as we shall see later, and the "​play"​ of some 
 +games is designed with 1Mhz in mind -- velocities, accelerations,​ etc.  ​
 +What looks smooth at the low frame rate might look poor at the high, as we 
 +shall also see later. ​ Finally, fastloaders invariably fail at 20Mhz,
 +like any other code using software-based timing.
 + The SuperCPU also has a series of configuration registers located
 +at $D07x and $D0Bx, which determine things like software speed and VIC
 +optimization modes (which areas of memory are mirrored/​copied to the C64'​s ​
 +RAM).  Note also that enabling hardware registers rearranges $E000 ROM 
 +routines. ​ Although it is possible for programs to accidentally reconfigure
 +the SCPU, it is awfully unlikely, since the enable register, which switches
 +the hardware registers in, is sandwiched between disable registers:
 + $D07D Hardware register disable
 + $D07E Enable hardware registers
 + $D07F Hardware register disable
 +Strangely enough, though, different hardware configurations can sometimes ​
 +cause problems. ​ For example, newer (v2) SCPUs allow selective mirroring of 
 +the stack and zero page, and by default have that mirroring turned OFF.  ​
 +For some totally unknown reason, this caused major problems with an early 
 +attempt of mine to fix Stunt Car Racer -- I am told that the old version ​
 +would slow down to just double-speed,​ flicker terribly, and more.  Turning ​
 +mirroring back on apparently fixes up the problem. ​ (I have an older SCPU, 
 +and hence did not have this problem). ​ So before going after a big fix, it 
 +is worthwhile to invest a few minutes in trying different configurations.
 + Finally, there are other strange problems that can arise. ​ For
 +example, I have two 128s: one is a flat 128, one a 128D.  With my 128D, 
 +if $D030 is set then the SCPU sometimes -- but not always -- freaks out 
 +and locks up.  The flat 128 does not have this problem. ​ One reason this 
 +is important is that many decompressors INC $D030 to enable 2MHz mode.
 +A simple BIT ($2C) fixes this problem up, but the point is that the SCPU has 
 +to interact with the computer, so perhaps that interaction can lead to 
 +problems in obscure cases.
 + Now, if the goal is to CMD-fix the program, there may be a few
 +disk-related things that may need fixing. ​ In addition to stripping out
 +(or possibly fixing up) any fastloaders,​ most programs annoyingly assume
 +drive #8 is the only drive in town.  Also, if the program uses a track-based ​
 +loader (instead of a file-based loader), then that will need to fixed up
 +as well, and any disk-based copy protection will have to be removed.
 + There'​s one other thing to consider, before you fix: is the
 +program really busted? ​ For example, if you've tried a chess program
 +with the SCPU, chances are that you saw no speed improvement. ​ Why
 +not?  It turns out that most chess programs use a timer-based search
 +algorithm -- changing the playing strength changes the amount of
 +time the program spends searching, and not the depth of the search.
 +(The reason is to make the gameplay flow a little better -- otherwise
 +you have very slow play at the beginning, when there are many more
 +moves to consider). ​ So although it might look like it isn't working
 +right with the SCPU, it is actually working quite well.
 + And that pretty much covers the basic ideas. ​ The first program
 +I fixed up was Stunt Car Racer.
 +Stunt Car Racer
 + Stunt Car Racer, in case you don't know, is a 3D driving game,
 +and quite fun.  It is also too fast, unplayably fast, at 20MHz. ​ Therefore, ​
 +it needs to be slowed down!  ​
 + My first plan... well, suffice to say that most of my original
 +plans were doomed to failure, either from being a bad idea, or from
 +poor implementation. ​ It is clear enough that some sort of delay is
 +needed, though, in the main loop, or perhaps by intercepting the joystick
 +reading routine.
 + The program has a main loop and an interrupt loop as well.
 +The interrupt handles the display and other things, and all of the
 +game calculations are done in the main loop, which flows like
 + Do some calculations
 + Draw buffer 1
 + Swap buffers
 + Do some calculations
 + Draw buffer 2
 + Swap buffers
 + JMP loop
 +One of my first thoughts was to intercept the joystick I/O, which is
 +easy to find by hunting for LDA $DC01 (or DC00, whichever joystick
 +is used). ​ The patch failed, and possibly because I didn't check that 
 +the memory was safe, and possibly because it was in the interrupt routine
 +(I simply don't remember).
 + Before patching, it is very important to make sure that the
 +patch will survive, and not interfere with the program, so it is
 +very important to find an area of memory that is not used by the
 +program. ​ It took me a little while to figure this out!  Finding
 +unused memory was pretty easy -- I just filled the suspect areas with
 +a fill byte, ran the program, and checked that memory. ​ Mapping out the 
 +memory areas also aids in saving the file, as un-needed areas don't
 +need to be saved, or can be cleared out to aid compression.
 + The first free area of memory I found was at $C000. ​ It turns
 +out that this is a sprite, though, and so put some garbage on the
 +screen. ​ The second I tried was $8000, which worked great in practice
 +mode but got overwritten in competition mode -- always test your
 +patches thoroughly! ​ (I had only tested in practice mode). ​ Finally,
 +I found a few little spots in low memory that survived, and placed the 
 +patch there. ​ The program does a whole lot of memory moving, and uses 
 +nearly all memory. ​ I also left some initialization code at $8000, since
 +it only needed to be run once, at the beginning (to turn on mirroring ​
 +in v2 SCPUs).
 + Recall that the main loop has two parts -- one for buffer 1, and 
 +one for buffer 2.  The trick is to find some code that is common to both 
 +sections, like a subroutine call:
 + JSR SomeRoutine
 + Draw buffer 1
 + JSR SomeRoutine
 + Draw buffer 2
 +The patch routine I used was a simple delay loop, redirected from those
 +two JSRs:
 + LDX #$FF
 + CPX $D012
 + BNE *-5
 + DEX
 + CPX #$FC
 + BNE *-10
 + JMP SomeRoutine
 +Of course, this will also slow the program down at 1Mhz; later on I became
 +smarter about my patches, but this one works pretty well.
 + To save the game and patches, I simply froze it from AR.  Just
 +saving from the monitor generally failed; the initialization routine
 +doesn'​t initialize all I/O settings. ​ Part of the freezing process
 +involves RLE compression,​ so if you freeze it is a good idea to
 +fill all unused portions of memory -- temporary areas, bitmaps, etc.
 +Another thing to do is to set a freeze point at the init routine,
 +and then JMP there from the monitor. ​ By clearing the screen, you
 +won't have to look at all the usual freezer ugliness, and at this
 +point freezing isn't any different than saving from the ML monitor
 +and RLE-packing the file.  Once saved, I tested a few times from the
 +64 side, to make sure things worked right.
 + Whether freezing or saving from the monitor, if the file size
 +is larger than 202 blocks or so, it can't be loaded on the SCPU without
 +a special loader -- unless you compress it first. ​ I naturally recommend
 +using pu-crunch for that purpose, but if you want to do it on the 64
 +then I recommend using ABCrunch, which works well with the SCPU and
 +gives about as good compression as you can get without an REU.
 + The result was stuntcar-scpu,​ which is *awfully* fun when fixed.
 +Rescue on Fractalus
 + Next on my list was Rescue on Fractalus, an older (and quite cool) 
 +Lucasfilm game that just didn't cut it in the 64 conversion, for a number
 +of reasons (that perhaps could have been avoided). ​ There are at least two
 +versions of the game, one of which doesn'​t even work on a 128 (good 'ol $D030),
 +but I have the older version, which does work.
 + With a SuperCPU, though, there are a number of problems. ​ The display
 +flickers terribly. ​ The gameplay is smooth and not at all too fast -- in fact,
 +it is too slow.  Specifically,​ the velocities and turning rates and such do
 +not give a convincing illusion of speed or excitement. ​ The game is copy-
 +protected and uses a track-based fastloader, loaded from disk via B-E, which
 +also saves the high scores to disk.  Clearly, this one is a bigger job: the 
 +display is too fast, the game constants need adjusting, and the highscore code
 +needs to be replaced by some kernal calls.
 + The structure of this code is a little different. ​ The main loop
 +handles the (double-buffered) display -- it does all the calculations and 
 +draws to the two buffers. ​ The multi-part interrupt loop does the rest -- 
 +it swaps buffers, changes the display in different parts of the screen,
 +reads the joystick, and performs the game updates which change your
 +position and direction. ​ It also handles enemies such as saucers, but
 +doesn'​t handle the bunkers which fire at you from the mountains (the main
 +loop takes care of those).
 + What does all this mean?  First, that the game can be a good ten
 +steps ahead of the screen, which makes things like targeting very
 +difficult. ​ Second, that the bunkers almost never fire at you at 1MHz
 +(they go crazy at 20).  Third, that things like velocity and turning
 +rate are rather low, because advancing or turning too quickly would
 +get the game way out of sync (unfortunately,​ they are still too fast
 +for 1MHz, making targeting difficult and movement clunky). ​ On the
 +other hand, having the movement in the interrupt is the reason that
 +the game does not become unplayably fast at 20MHz, and means that
 +something besides a delay loop is needed.
 + The interrupt swaps buffers, but the main loop draws them,
 +and because it draws so quickly it can start clearing and drawing to 
 +the visible buffer. ​ To make sure this was what I was seeing, I reversed
 +the buffer swap code in the interrupt, so that the drawing buffer was
 +always on-screen. ​ Sure enough, that's what the 20Mhz version looked
 + It turned out to be pretty easy to force the main loop to wait
 +on the interrupt. ​ Although I messed around (unsuccessfully) with
 +intercepting the interrupt loop, the buffer swap code actually
 +modifies a zero-page variable upon swapping. ​ So all the main loop
 +has to do is wait on that variable before charging ahead. ​ I may have
 +made it wait for two frames, because it made the game play a little
 + Now, how to find the velocity and turn code?  Well it takes
 +a keypress to change the velocity, so by hunting for LDA $DC01, and
 +tracing back, the routine can be found; at the very least the
 +affected variables may be found, and hunted for.  For example, if
 +the result is stored in $D0, then you can search for LDA $D0.  The
 +point is to locate the keypress processing code.  From there, a little
 +trial and error (setting freeze points and pressing the velocity key)
 +locates the piece of code which deals with changing the velocity, and 
 +in particular which variable corresponds to velocity. ​ Finally, from 
 +there it just takes another hunt for LDA velocity, ADC velocity, etc.
 +to figure out where the code for updating position and direction is.
 + In this case, I was pretty sure I had found it, as it went
 +something like
 + LDA velocity
 + LSR
 + LSR
 + ADC #$20
 +and this was added to the position. ​ To check that this was the code,
 +I just changed the ADC, or removed an LSR, to see that the speed changed.
 +The code for turning left and right and moving up and down was similar,
 +and again after a little trial and error it was clear what code did
 +what.  Again, it wasn't necessary to totally understand how these
 +routines worked exactly -- just the general idea of them, in this case
 +to see that a multiple of the velocity was used to change the position
 +and orientation of the player.
 + So, to fix it up, I just changed that multiple -- probably I
 +NOPed out an LSR above, to basically double the speed, and changed the 
 +turning rates similarly. ​ This took a little experimentation,​ as it
 +not only needed to be playably fast, but also couldn'​t overflow at
 +high speeds, etc.
 + But once that was working, all that remained was the highscore
 +table. ​ Finding the table location was pretty easy -- I just got a high
 +score, and while entering my name froze the program, and figured out
 +what got stored where. ​ From there it was pretty easy to figure out
 +what was saved to disk.  From the original loader, I also knew where
 +the highscores needed to be loaded to initially (the highscore table
 +gets copied around a lot -- it doesn'​t just stay at that one location).
 +Figuring out the exact number of bytes to save took a little bit of
 +effort (saving either too many or too few bytes screws it up), but
 +from there it was clear what memory needed to be saved.
 + So all that remained was to add the usual SETLFS etc. kernal
 +calls, right? ​ Wrong. ​ The program uses all the usual kernal variables
 +(from $90-$C0) for its own purposes. ​ Also recall that I wanted the
 +program to work with device 9, etc.  To get around this, I did two
 +things. ​ First, when the program first starts, I save some of the
 +relevant variables to an unused part of memory -- in particular, I
 +save the current drive number. ​ Second, before saving the highscore
 +file, I actually copy all zero page variables from $90-$C2 or so
 +to a temporary location, and then copy them back after saving.
 +That way there are no worries about altering important locations.
 + Finding memory for the load/save patch was easy -- I just used
 +the area which was previously used for the fastload load/save code.
 +There was enough for the necessary code as well as temporary space
 +for saving the zero page variables.
 + Finally, I changed some text from Rescue on Fractalus to
 +Behind Jaggi Lines, to distinguish it from the original, and that
 +was that.  Works great! ​ And is now more playable and challenging;​
 +in short, more the game it always should have been.
 +Stellar 7
 + Finally, I tried my hand at Stellar 7.  Stellar 7 had several
 +problems. ​ At the main screen, a delay loop tosses you to the mission
 +screen after a while, if no keys are pressed. ​ This is a software loop,
 +and so passes very quickly. ​ The game itself is too fast, so some sort
 +of delay is needed. ​ The mission display is also too fast, and has
 +software delay loops, so that needs fixing. ​ Finally, the game uses 
 +kernal calls for loading and saving, but is attached to drive #8;
 +also, my version was split into a bunch of files, and I wanted to
 +cut the number of files down.
 + Well, by this time it was all pretty straightforward. ​ From
 +the loader, it was easy to figure out which files went where. ​ The
 +mission and main displays were loaded in when needed, and swapped
 +into unused parts of memory when not, so I loaded them in and
 +adjusted the swap variable accordingly -- this left just the highscore
 +and seven level files.
 + Finding the delay loops was easy -- I just went to the relevant
 +sections of code, froze, and took a look at the loops. ​ There were your
 + LDA $D4 ;Check for keypress
 + BMI :key
 + DEX
 + DEY
 + DEC counter
 +:key LDX #$00
 + ...
 +Luckily, all routines were pretty much the same as the above. ​ The
 +interrupt routine is in the $0314 vector, and the same routine is
 +used during gameplay.  ​
 + So the patch is very easy at this point. ​ First, change the
 +IRQ code which does a JMP $EA7B to JMP $CE00
 +. CE00  $EE INC $CFFF
 +. CE03  $4C JMP $EA7B
 +To fix up the keypress routines, the idea is to change the LDA $D0
 +into a JSR patch. ​ How to substitute 3 bytes for 2 bytes? ​ The
 +trick is to place the LDX #$00 into the patch routine:
 +. CE06  $20 JSR $CE15 ;​Wait for $CFFF
 +. CE09  $A5 LDA $D4
 +. CE0B  $10 BPL $CE11
 +. CE0D  $A2 LDX #$00 ;If key pressed, then LDX #$00
 +. CE0F  $29 AND #$FF
 +. CE11  $60 RTS
 +The actual delay is accomplished by waiting on $CFFF:
 +. CE15  $AD LDA $CFFF
 +. CE18  $C9 CMP #$04
 +. CE1A  $90 BCC $CE15
 +. CE1C  $A9 LDA #$00
 +. CE1E  $8D STA $CFFF
 +. CE21  $60 RTS
 +As you can see, I waited a (default) of 4 frames. ​ The patch in the
 +game/​mission rendering routine works similarly -- I just patched
 +the rendering code to basically JSR $CE15. ​ I also decided to
 +try something new: let the user be able to change that CMP #$04
 +to make things faster or slower, to suit their tastes. ​ The keyscan
 +values were pretty easy to figure out, so this just required a little ​
 +patch to check for the "​+"​ and "​-"​ keys, and change location $CE19
 + Well, that about sums it up.  Perhaps if you do some fixing,
 +you might send me a little email describing your own experiences?​
 +.                                    C=H #17
 +An Optimizing Hybrid LZ77 RLE Data Compression Program, aka
 +Improving Compression Ratio for Low-Resource Decompression
 +by Pasi Ojala <​>​
 +Pucrunch is a Hybrid LZ77 and RLE compressor, uses an Elias Gamma Code for
 +lengths, mixture of Gamma Code and linear for LZ77 offset, and ranked RLE
 +bytes indexed by the same Gamma Code.  Uses no extra memory in decompression.
 +Since I started writing demos for the C64 in 1989 I have always wanted to
 +program a compression program. ​ I had a lot of ideas but never had the
 +time, urge or knowledge to create one.  In retrospect, most of the ideas I
 +had then were simply bogus ("​magic function theory"​ as Mark Nelson nicely
 +puts it).  But years passed, I gathered more knowledge and finally got an
 +irresistible urge to finally realize my dream.
 +The nice thing about the delay is that I don't need to write the actual
 +compression program to run on a C64 anymore. ​ I can write it in portable
 +ANSI-C code and just program it to create files that would uncompress
 +themselves when run on a C64.  Running the compression program outside of
 +the target system provides at least the following advantages.
 +  * I can use portable ANSI-C code.  The compression program can be
 +    compiled to run on a Unix box, Amiga, PC etc.  And I have all the
 +    tools to debug the program and gather profiling information to see
 +    why it is so slow :-)
 +  * The program runs much faster than on C64.  If it is still slow, there
 +    is always multitasking to allow me to do something else while I'm
 +    compressing a file.
 +  * There is '​enough'​ memory available. ​ I can use all the memory I
 +    possibly need and use every trick possible to increase the compression
 +    ratio as long as the decompression remains possible on a C64.
 +  * Large files can be compressed as easily as shorter files. ​ Most C64
 +    compressors can't handle files larger than around 52-54 kilobytes
 +    (210-220 disk blocks).
 +  * Cross-development is easier because you don't have to transfer a
 +    file into C64 just to compress it.
 +Memory Refresh and Terms for Compression
 +Statistical compression
 +       Uses the uneven probability distribution of the source symbols
 +       to shorten the average code length. ​ Huffman code and arithmetic
 +       code belong to this group. ​ By giving a short code to symbols
 +       ​occurring most often, the number of bits needed to represent
 +       the symbols decreases. ​ Think of the Morse code for example: ​ the
 +       ​characters you need more often have shorter codes and it takes
 +       less time to send the message.
 +Dictionary compression
 +       ​Replaces repeating strings in the source with shorter
 +       ​representations. ​ These may be indices to an actual dictionary
 +       ​(Lempel-Ziv 78) or pointers to previous occurrences (Lempel-Ziv
 +       ​77). ​ As long as it takes fewer bits to represent the reference
 +       than the string itself, we get compression. ​ LZ78 is a lot like
 +       the way BASIC substitutes tokens for keywords: ​ one-byte tokens
 +       ​expand to whole words like PRINT#​. ​ LZ77 replaces repeated
 +       ​strings with (length,​offset) pairs, thus the string VICIICI can
 +       be encoded as VICI(3,3) -- the repeated occurrence of the
 +       ​string ICI is replaced by a reference.
 +Run-length encoding
 +       ​Replaces repeating symbols with a single occurrence of the
 +       ​symbol and a repeat count. ​ For example assembly compilers have
 +       a .zero keyword or equivalent to fill a number of bytes with
 +       zero without needing to list them all in the source code.
 +Variable-length code
 +       Any code where the length of the code is not explicitly known
 +       but changes depending on the bit values. ​ Some kind of end
 +       ​marker or length count must be provided to make a code a prefix
 +       code (uniquely decodable). ​ Compare with ASCII (or Latin-1) text,
 +       where you know you get the next letter by reading a full byte from 
 +       the input. ​ A variable-length code requires you to read part of the 
 +       data to know how many bits to read next.
 +Universal codes
 +       ​Universal codes are used to encode integer numbers without the
 +       need to know the maximum value. ​ Smaller integer values usually
 +       get shorter codes. ​ Different universal codes are optimal for
 +       ​different distributions of the values. ​ Universal codes include
 +       Elias Gamma and Delta codes, Fibonacci code, and Golomb and
 +       Rice codes.
 +Lossless compression
 +       ​Lossless compression algorithms are able to exactly reproduce
 +       the original contents unlike lossy compression,​ which omits
 +       ​details that are not important or perceivable by human sensory
 +       ​system. ​ This article only talks about lossless compression.
 +My goal in the pucrunch project was to create a compression system in which
 +the decompressor would use minimal resources (both memory and processing
 +power) and still have the best possible compression ratio. ​ A nice bonus
 +would be if it outperformed every other compression program available.
 +These understandingly opposite requirements (minimal resources and good
 +compression ratio) rule out most of the state-of-the-art compression
 +algorithms and in effect only leave RLE and LZ77 to be considered. ​ Another
 +goal was to learn something about data compression and that goal at least
 +has been satisfied.
 +I started by developing a byte-aligned LZ77+RLE compressor/​decompressor and
 +then added a Huffman backend to it.  The Huffman tree took 384 bytes and
 +the code that decoded the tree into an internal representation took 100
 +bytes. ​ I found out that while the Huffman code gained about 8% in my
 +40-kilobyte test files, the gain was reduced to only about 3% after
 +accounting the extra code and the Huffman tree.
 +Then I started a more detailed analysis of the LZ77 offset and length
 +values and the RLE values and concluded that I would get better compression
 +by using a variable-length code.  I used a simple variable-length code and
 +scratched the Huffman backend code, as it didn't increase the compression
 +ratio anymore. ​ This version became pucrunch.
 +Pucrunch does not use byte-aligned data, and is a bit slower than the
 +byte-aligned version because of this, but is much faster than the original
 +version with the Huffman backend attached. ​ And pucrunch still does very
 +well compression-wise. ​ In fact, it does very well indeed, beating even
 +LhA, Zip, and GZip in some cases. ​ But let's not get too much ahead of
 +To get an improvement to the compression ratio for LZ77, we have only a few
 +options left.  We can improve on the encoding of literal bytes (bytes that
 +are not compressed),​ we can reduce the number of literal bytes we need to
 +encode, and shorten the encoding of RLE and LZ77.  In the algorithm
 +presented here all these improvement areas are addressed both collectively
 +(one change affects more than one area) and one at a time.
 + 1. By using a variable-length code we can gain compression for even
 +    2-byte LZ77 matches, which in turn reduces the number of literal
 +    bytes we need to encode. ​ Most LZ77-variants require 3-byte matches
 +    to get any compression because they use so many bits to identify
 +    the length and offset values, thus making the code longer than the
 +    original bytes would'​ve taken.
 + 2. By using a new literal byte tagging system which distinguishes
 +    uncompressed and compressed data efficiently we can reduce number
 +    of extra bits needed to make this distinction (the encoding
 +    overhead for literal bytes). ​ This is especially important for
 +    files that do not compress well.
 + 3. By using RLE in addition to LZ77 we can shorten the encoding for
 +    long byte run sequences and at the same time set a convenient
 +    upper limit to LZ77 match length. ​ The upper limit performs two
 +    functions:
 +       + we only need to encode integers in a specific range
 +       + we only need to search strings shorter than this limit (if we
 +         find a string long enough, we can stop there)
 +    Short byte runs are compressed either using RLE or LZ77, whichever
 +    gets the best results.
 + 4. By doing statistical compression (more frequent symbols get
 +    shorter representations) on the RLE bytes (in this case symbol
 +    ranking) we can gain compression for even 2-byte run lengths,
 +    which in turn reduces the number of literal bytes we need to
 +    encode.
 + 5. By carefully selecting which string matches and/or run lengths to
 +    use we can take advantage of the variable-length code.  It may be
 +    advantageous to compress a string as two shorter matches instead
 +    of one long match and a bunch of literal bytes, and it can be
 +    better to compress a string as a literal byte and a long match
 +    instead of two shorter matches.
 +This document consists of several parts, which are:
 +  * C64 Considerations - Some words about the target system
 +  * Escape codes - A new tagging system for literal bytes
 +  * File format - What are the primaries that are output
 +  * Graph search - How to squeeze every byte out of this method
 +  * String match - An evolution of how to speed up the LZ77 search
 +  * Some results on the target system files
 +  * Results on the Calgary Corpus Test Suite
 +  * The Decompression routine - 6510 code with commentary
 +Commodore 64 Considerations
 +Our target environment (Commodore 64) imposes some restrictions which we
 +have to take into consideration when designing the ideal compression
 +system. ​ A system with a 1-MHz 3-register 8-bit processor and 64 kilobytes
 +of memory certainly imposes a great challenge, and thus also a great sense
 +of achievement for good results.
 +First, we would like it to be able to decompress as big a program as
 +possible. ​ This in turn requires that the decompression code is located in
 +low memory (most programs that we want to compress start at address 2049)
 +and is as short as possible. ​ Also, the decompression code must not use any
 +extra memory or only very small amounts of it.  Extra care must be taken to
 +make certain that the compressed data is not overwritten during the
 +decompression before it has been read.
 +Secondly, my number one personal requirement is that the basic end address
 +must be correctly set by the decompressor so that the program can be
 +optionally saved in uncompressed form after decompression (although the
 +current decompression code requires that you say "​clr"​ before saving).
 +This also requires that the decompression code is system-friendly,​ i.e.
 +does not change KERNAL or BASIC variables or other structures. ​ Also, the
 +decompressor shouldn'​t rely on file size or load end address pointers,
 +because these may be corrupted by e.g.  X-modem file transfer protocol
 +(padding bytes may be added).
 +When these requirements are combined, there is not much selection in where
 +in the memory we can put the decompression code.  There are some locations
 +among the first 256 addresses (zeropage) that can be used, the (currently)
 +unused part of the processor stack (0x100..0x1ff),​ the system input buffer
 +(0x200..0x258) and the tape I/O buffer plus some unused bytes (0x334-0x3ff).
 +The screen memory (0x400..0x7ff) can also be used if necessary. ​ If we can 
 +do without the screen memory and the tape buffer, we can potentially ​
 +decompress files that are located from 0x258 to 0xffff.
 +The third major requirement is that the decompression should be relatively
 +fast.  After 10 seconds the user begins to wonder if the program has crashed ​
 +or if it is doing anything, even if there is some feedback like border color
 +flashing. ​ This means that the arithmetic used should be mostly 8- or 9-bit
 +(instead of full 16 bits) and there should be very little of it per each
 +decompressed byte.  Processor- and memory-intensive algorithms like
 +arithmetic coding and prediction by partial matching (PPM) are pretty much
 +out of the question, and that is saying it mildly. ​ LZ77 seems the only
 +practical alternative. ​ Still, run-length encoding handles long byte runs
 +better than LZ77 and can have a bigger length limit. ​ If we can easily
 +incorporate RLE and LZ77 into the same algorithm, we should get the best
 +features from both.
 +A part of the decompressor efficiency depends on the format of the
 +compressed data.  Byte-aligned codes, where everything is aligned into byte
 +boundaries, can be accessed very quickly; non-byte-aligned variable length
 +codes are much slower to handle, but provide better compression. ​ Note that
 +byte-aligned codes can still have other data sizes than 8.  For example you
 +can use 4 bits for LZ77 length and 12 bits for LZ77 offset, which preserves
 +the byte alignment.
 +The New Tagging System
 +I call the different types of information my compression algorithm outputs
 +primaries. ​ The primaries in this compression algorithm are:
 +  * literal (uncompressed) bytes and escape sequences,
 +  * LZ77 (length,​offset)-pairs,​
 +  * RLE (length,​byte)-pairs,​ and
 +  * EOF (end of file marker).
 +Literal bytes are those bytes that cannot be represented by shorter codes, ​
 +unlike a part of previously seen data (LZ77), or a part of a longer
 +sequence of the same byte (RLE).
 +Most compression programs handle the selection between compressed data and
 +literal bytes in a straightforward way by using a prefix bit.  If the bit
 +is 0, the following data is a literal byte (uncompressed). ​ If the bit is
 +1, the following data is compressed. ​ However, this presents the problem
 +that non-compressible data will be expanded from the original 8 bits to 9
 +bits per byte, i.e. by 12.5 %.  If the data isn't very compressible,​ this
 +overhead consumes all the little savings you may have had using LZ77 or
 +Some other data compression algorithms use a value (using variable-length
 +code) that indicates the number of literal bytes that follow, but this is
 +really analogous to a prefix bit, because 1-byte uncompressed data is very
 +common for modestly compressible files. ​ So, using a prefix bit may seem
 +like a good idea, but we may be able to do even better. ​ Let's see what we
 +can come up with.  My idea was to somehow use the data itself to mark
 +compressed and uncompressed data and thus not need any prefix bits.
 +Let's assume that 75% of the symbols generated are literal bytes. ​ In this
 +case it seems viable to allocate shorter codes for literal bytes, because
 +they are more common than compressed data.  This distribution (75% are
 +literal bytes) suggests that we should use 2 bits to determine whether the
 +data is compressed or a literal byte.  One of the combinations indicates
 +compressed data, and three of the combinations indicate a literal byte.  At
 +the same time those three values divide the literal bytes into three
 +distinct groups. ​ But how do we make the connection between which of the
 +three bit patters we have and what are the literal byte values?
 +The simplest way is to use a direct mapping. ​ We use two bits (let them be
 +the two most-significant bits) _from the literal bytes themselves_ to
 +indicate compressed data.  This way no actual prefix bits are needed. ​ We
 +maintain an escape code (which doesn'​t need to be static), which is
 +compared to the bits, and if they match, compressed data follows. ​ If the
 +bits do not match, the rest of the literal byte follows. ​ In this way the
 +literal bytes do not expand at all if their most significant bits do not
 +match the escape code, and fewer bits are needed to represent the literal
 +Whenever those bits in a literal byte would match the escape code, an
 +escape sequence is generated. ​ Otherwise we could not represent those
 +literal bytes which actually start like the escape code (the top bits
 +match). ​ This escape sequence contains the offending data and a new escape
 +code.  This escape sequence looks like
 +     # of escape bits    (escape code)
 +     ​3 ​                  ​(escape mode select)
 +     # of escape bits    (new escape bits)
 +     8-# of escape bits  (rest of the byte)
 +     = 8 + 3 + # of escape bits
 +     = 13 for 2-bit escapes, i.e. expands the literal byte by 5 bits.
 +Read further to see how we can take advantage of the changing escape code.
 +You may also remember that in the run-length encoding presented in the
 +previous article two successive equal bytes are used to indicate compressed
 +data (escape condition) and all other bytes are literal bytes. ​ A similar
 +technique is used in some C64 packers (RLE) and crunchers (LZ77), the only
 +difference is that the escape condition is indicated by a fixed byte value.
 +My tag system is in fact an extension to this.  Instead of a full byte, I
 +use only a few bits.
 +We assumed an even distribution of the values and two escape bits, so 1/4
 +of the values have the same two most significant bits as the escape code.
 +I call this probability that a literal byte has to be escaped the "hit rate".
 +Thus, literal bytes expand in average 25% of the time by 5 bits, making the
 +average length 25% * 13 + 75% * 8 = 9.25.  Not surprising, this is longer
 +than using one bit to tag the literal bytes. ​ However, there is one thing
 +we haven'​t considered yet.  The escape sequence has the possibility to
 +change the escape code.  Using this feature to its optimum (escape
 +optimization),​ the average 25% hit rate becomes the -maximum- hit rate.
 +Also, because the distribution of the literal byte values is seldom flat 
 +(some values are more common than others) and there is locality (different ​
 +parts of the file only contain some of the possible values), from which we 
 +can also benefit, the actual hit rate is always much smaller than that.  ​
 +Empirical studies on some test files show that for 2-bit escape codes the 
 +actual realized hit rate is only 1.8-6.4%, while the theoretical maximum ​
 +is the already mentioned 25%.
 +Previously we assumed the distribution of 75% of literal bytes and 25% of
 +compressed data (other primaries). ​ This prompted us to select 2 escape
 +bits.  For other distributions (differently compressible files, not
 +necessarily better or worse) some other number of escape bits may be more
 +suitable. ​ The compressor tries different number of escape bits and select
 +the value which gives the best overall results. ​ The following table
 +summarizes the hit rates on the test files for different number of escape
 +   ​1-bit ​  ​2-bit ​  ​3-bit ​  ​4-bit ​  File
 +   ​50.0% ​  ​25.0% ​  ​12.5% ​  ​6.250% ​ Maximum
 +   ​25.3% ​   2.5%    0.3%   ​0.090% ​ ivanova.bin
 +   ​26.5% ​   2.4%    0.8%   ​0.063% ​ sheridan.bin
 +   ​20.7% ​   1.8%    0.2%   ​0.041% ​ delenn.bin
 +   ​26.5% ​   6.4%    2.5%   ​0.712% ​ bs.bin
 +    9.06    8.32    8.15   ​8.050 ​  ​bits/​Byte for bs.bin
 +As can be seen from the table, the realized hit rates are dramatically
 +smaller than the theoretical maximum values. ​ A thought might occur that we
 +should always select 4-bit (or longer) escapes, because it reduces the hit
 +rate and presents the minimum overhead for literal bytes. ​ Unfortunately
 +increasing the number of escape bits also increases the code length of the
 +compressed data.  So, it is a matter of finding the optimum setting.
 +If there are very few literal bytes compared to other primaries, 1-bit
 +escape or no escape at all gives very short codes to compressed data, but
 +causes more literal bytes to be escaped, which means 4 bits extra for each
 +escaped byte (with 1-bit escapes). ​ If the majority of primaries are
 +literal bytes, for example a 6-bit escape code causes most of the literal
 +bytes to be output as 8-bit codes (no expansion), but makes the other
 +primaries 6 bits longer. ​ Currently the compressor automatically selects
 +the best number of escape bits, but this can be overridden by the user with
 +the -e option.
 +The cases in the example with 1-bit escape code validates the original
 +suggestion: ​ use a prefix bit.  A simple prefix bit would produce better
 +results on three of the previous test files (although only slightly). ​ For
 +delenn.bin (1 vs 0.828) the escape system works better. ​ On the other hand,
 +1-bit escape code is not selected for any of the files, because 2-bit
 +escape gives better overall results.
 +-Note:- for 7-bit ASCII text files, where the top bit is always 0 (like
 +most of the Calgary Corpus files), the hit rate is 0% for even 1-bit
 +escapes. ​ Thus, literal bytes do not expand at all.  This is equivalent to
 +using a prefix bit and 7-bit literals, but does not need separate algorithm
 +to detect and handle 7-bit literals.
 +For Calgary Corpus files the number of tag bits per primary (counting the
 +escape sequences and other overhead) ranges from as low as 0.46 (book1) to
 +1.07 (geo) and 1.09 (pic). ​ These two files (geo and pic) are the only ones
 +in the suite where a simple prefix bit would be better than the escape
 +system. ​ The average is 0.74 tag bits per primary.
 +In Canterbury Corpus the tag bits per primary ranges from 0.44
 +(plrabn12.txt) to 1.09 (ptt5), which is the only one above 0.85 (sum).
 +The average is 0.61 tag bits per primary.
 +Primaries Used for Compression
 +The compressor uses the previously described escape-bit system while
 +generating its output. ​ I call the different groups of bits that are
 +generated primaries, whether it is the correct term or not.  You are
 +welcome to suggest a better term for them.  The primaries in this
 +compression algorithm are:  literal byte (and escape sequence), LZ77
 +(length,​offset)-pair,​ RLE (length, byte)-pair, and EOF (end of file
 +If the top bits of a literal byte do not match the escape code, the byte is
 +output as-is. ​ If the bits match, an escape sequence is generated, with the
 +new escape code.  Other primaries start with the escape code.
 +The Elias Gamma Code is used extensively. ​ This code consists of two parts:
 +a unary code (a one-bit preceded by zero-bits) and a binary code part.  The
 +first part tells the decoder how many bits are used for the binary code
 +part.  Being a universal code, it produces shorter codes for small integers
 +and longer codes for larger integers. ​ Because we expect we need to encode
 +a lot of small integers (there are more short string matches and shorter
 +equal byte runs than long ones), this reduces the total number of bits
 +needed. ​ See the previous article for a more in-depth delve into
 +statistical compression and universal codes. ​ To understand this article,
 +you only need to keep in mind that small integer value equals short code.
 +The following discusses the encoding of the primaries.
 +The most frequent compressed data is LZ77.  The length of the match is
 +output in Elias Gamma code, with "​0"​ meaning the length of 2, "​100"​ length
 +of 3, "​101"​ length of 4 and so on.  If the length is not 2, a LZ77 offset
 +value follows. ​ This offset takes 9 to 22 bits.  If the length is 2, the
 +next bit defines whether this is LZ77 or RLE/​Escape. ​ If the bit is 0, an
 +8-bit LZ77 offset value follows. ​ (Note that this restricts the offset for
 +2-byte matches to 1..256.) If the bit is 1, the next bit decides between
 +escape (0) and RLE (1).
 +The code for an escape sequence is thus,​ where E is the
 +byte, and N is the new escape code.  Example:
 +  * We are using 2-bit escapes
 +  * The current escape code is "​11"​
 +  * We need to encode a byte 0xca == 0b11001010
 +  * The escape code and the byte high bits match (both are "​11"​)
 +  * We output the current escape code "​11"​
 +  * We output the escaped identification "​010"​
 +  * We output the new escape bits, for example "​10"​ (depends on escape
 +    optimization)
 +  * We output the rest of the escaped byte "​001010"​
 +  * So, we have output the string "​1101010001010"​
 +When the decompressor receives this string of bits, it finds that the first
 +two bits match with the escape code, it finds the escape identification
 +("​010"​) and then gets the new escape, the rest of the original byte and
 +combines it with the old escape code to get a whole byte.
 +The end of file condition is encoded to the LZ77 offset and the RLE is
 +subdivided into long and short versions. ​ Read further, and you get a
 +better idea about why this kind of encoding is selected.
 +When I studied the distribution of the length values (LZ77 and short RLE
 +lengths), I noticed that the smaller the value, the more occurrences.
 +The following table shows an example of length value distribution.
 +         ​LZLEN ​  S-RLE
 +  2       ​1975 ​    477
 +  3-4     ​1480 ​    330
 +  5-8      492     166
 +  9-16     ​125 ​     57
 +  17-32     ​31 ​     33
 +  33-64      8      15
 +The first column gives a range of values. ​ The first entry has a single
 +value (2), the second two values (3 and 4), and so on.  The second column
 +shows how many times the different LZ77 match lengths are used, the last
 +column shows how many times short RLE lengths are used.  The distribution
 +of the values gives a hint of how to most efficiently encode the values.
 +We can see from the table for example that values 2-4 are used 3455 times,
 +while values 5-64 are used only 656 times. ​ The more common values need to
 +get shorter codes, while the less-used ones can be longer.
 +Because in each "​magnitude"​ there are approximately half as many values
 +than in the preceding one, it almost immediately occurred to me that the
 +optimal way to encode the length values (decremented by one) is:
 +  Value      Encoding ​      ​Range ​    ​Gained
 +  0000000 ​   not possible
 +  0000001 ​   0              1         -6 bits
 +  000001x ​   10x            2-3       -4 bits
 +  00001xx ​   110xx          4-7       -2 bits
 +  0001xxx ​   1110xxx ​       8-15      +0 bits
 +  001xxxx ​   11110xxxx ​     16-31     +2 bits
 +  01xxxxx ​   111110xxxxx ​   32-63     +4 bits
 +  1xxxxxx ​   111111xxxxxx ​  ​64-127 ​   +5 bits
 +The first column gives the binary code of the original value (with x
 +denoting 0 or 1, xx 0..3, xxx 0..7 and so on), the second column gives the
 +encoding of the value(s). ​ The third column lists the original value range
 +in decimal notation.
 +The last column summarizes the difference between this code and a 7-bit
 +binary code.  Using the previous encoding for the length distribution
 +presented reduces the number of bits used compared to a direct binary
 +representation considerably. ​ Later I found out that this encoding in fact
 +is Elias Gamma Code, only the assignment of 0- and 1-bits in the prefix is
 +reversed, and in this version the length is limited. ​ Currently the maximum
 +value is selectable between 64 and 256.
 +So, to recap, this version of the Gamma code can encode numbers from 1 to
 +255 (1 to 127 in the example). ​ LZ77 and RLE lengths that are used start
 +from 2, because that is the shortest length that gives us any compression.
 +These length values are first decremented by one, thus length 2 becomes
 +"​0",​ and for example length 64 becomes "​11111011111"​.
 +The distribution of the LZ77 offset values (pointer to a previous
 +occurrence of a string) is not at all similar to the length distribution.
 +Admittedly, the distribution isn't exactly flat, but it also isn't as
 +radical as the length value distribution either. ​ I decided to encode the
 +lower 8 bits (automatically selected or user-selectable between 8 and 12
 +bits in the current version) of the offset as-is (i.e.  binary code) and
 +the upper part with my version of the Elias Gamma Code.  However, 2-byte
 +matches always have an 8-bit offset value. ​ The reason for this is
 +discussed shortly.
 +Because the upper part can contain the value 0 (so that we can represent
 +offsets from 0 to 255 with a 8-bit lower part), and the code can't directly
 +represent zero, the upper part of the LZ77 offset is incremented by one
 +before encoding (unlike the length values which are decremented by one).
 +Also, one code is reserved for an end of file (EOF) symbol. ​ This restricts
 +the offset value somewhat, but the loss in compression is negligible.
 +With the previous encoding 2-byte LZ77 matches would only gain 4 bits (with
 +2-bit escapes) for each offset from 1 to 256, and 2 bits for each offset
 +from 257 to 768.  In the first case 9 bits would be used to represent the
 +offset (one bit for gamma code representing the high part 0, and 8 bits for
 +the low part of the offset), in the latter case 11 bits are used, because
 +each "​magnitude"​ of values in the Gamma code consumes two more bits than
 +the previous one.
 +The first case (offset 1..256) is much more frequent than the second case,
 +because it saves more bits, and also because the symbol source statistics
 +(whatever they are) guarantee 2-byte matches in recent history (much better
 +chance than for 3-byte matches, for example). ​ If we restrict the offset
 +for a 2-byte LZ77 match to 8 bits (1..256), we don't lose so much
 +compression at all, but instead we could shorten the code by one bit.  This
 +one bit comes from the fact that before we had to use one bit to make the
 +selection "8-bit or longer"​. ​ Because we only have "​8-bit"​ now, we don't
 +need that select bit anymore.
 +Or, we can use that select bit to a new purpose to select whether this code
 +really is LZ77 or something else.  Compared to the older encoding (which
 +I'm not detailing here, for clarity'​s sake.  This is already much too
 +complicated to follow, and only slightly easier to describe) the codes for
 +escape sequence, RLE and End of File are still the same length, but the
 +code for LZ77 has been shortened by one bit.  Because LZ77 is the most
 +frequently used primary, this presents a saving that more than compensates
 +for the loss of 2-byte LZ77 matches with offsets 257..768 (which we can no
 +longer represent, because we fixed the offset for 2-byte matches to use
 +exactly 8 bits).
 +Run length encoding is also a bit revised. ​ I found out that a lot of bits
 +can be gained by using the same length encoding for RLE as for LZ77.  On
 +the other hand, we still should be able to represent long repeat counts as
 +that's where RLE is most efficient. ​ I decided to split RLE into two modes:
 +  * short RLE for short (e.g. 2..128) equal byte strings
 +  * long RLE for long equal byte strings
 +The Long RLE selection is encoded into the Short RLE code.  Short RLE only
 +uses half of its coding space, i.e.  if the maximum value for the gamma
 +code is 127, short RLE uses only values 1..63. ​ Larger values switches the
 +decoder into Long RLE mode and more bits are read to complete the run
 +length value.
 +For further compression in RLE we rank all the used RLE bytes (the values
 +that are repeated in RLE) in the decreasing probability order. ​ The values
 +are put into a table, and only the table indices are output. ​ The indices
 +are also encoded using a variable length code (the same gamma code,
 +surprise..),​ which uses less bits for smaller integer values. ​ As there are
 +more RLE's with smaller indices, the average code length decreases. ​ In
 +decompression we simply get the gamma code value and then use the value as
 +an index into the table to get the value to repeat.
 +Instead of reserving full 256 bytes for the table we only put the top 31
 +RLE bytes into the table. ​ Normally this is enough. ​ If there happens to be
 +a byte run with a value not in the table we use a similar technique as for
 +the short/long RLE selection. ​ If the table index is larger than 31, it
 +means we don't have the value in the table. ​ We use the values 32..63 to
 +select the '​escaped'​ mode and simultaneously send the 5 most significant
 +bits of the value (there are 32 distinct values in the range 32..63). ​ The
 +rest 3 bits of the byte are sent separately.
 +If you are more than confused, forget everything I said in this chapter and
 +look at the decompression pseudo-code later in this article.
 +Graph Search - Selecting Primaries
 +In free-parse methods there are several ways to divide the file into parts,
 +each of which is equally valid but not necessary equally efficient in terms
 +of compression ratio.
 +     "​i just saw justin adjusting his sting"
 +     "​i just saw", (-9,4), "in ad", (-9,6), "g his", (-25,2), (-10,4)
 +     "​i just saw", (-9,4), "in ad", (-9,6), "g his ", (-10,5)
 +The latter two lines show how the sentence could be encoded using literal
 +bytes and (offset, length) pairs. ​ As you can see, we have two different
 +encodings for a single string and they are both valid, i.e.  they will
 +produce the same string after decompression. ​ This is what free-parse is:
 +there are several possible ways to divide the input into parts. ​ If we are
 +clever, we will of course select the encoding that produces the shortest
 +compressed version. ​ But how do we find this shortest version? ​ How does
 +the data compressor decide which primary to generate in each step?
 +The most efficient way the file can be divided is determined by a sort of a
 +graph-search algorithm, which finds the shortest possible route from the
 +start of the file to the end of the file.  Well, actually the algorithm
 +proceeds from the end of the file to the beginning for efficiency reasons,
 +but the result is the same anyway: ​ the path that minimizes the bits
 +emitted is determined and remembered. ​ If the parameters (number of escape
 +bits or the variable length codes or their parameters) are changed, the
 +graph search must be re-executed.
 +     "​i just saw justin adjusting his sting"
 +                 ​\___/ ​   \_____/ ​   \_|___/
 +                  13       ​15 ​       11 13
 +                                      \____/
 +                                       15
 +Think of the string as separate characters. ​ You can jump to the next
 +character by paying 8 bits to do so (not shown in the figure), unless the
 +top bits of the character match with the escape code (in which case you
 +need more bits to send the character "​escaped"​). ​ If the history buffer
 +contains a string that matches a string starting at the current character
 +you can jump over the string by paying as many bits as representing the
 +LZ77 (offset,​length)-pair takes (including escape bits), in this example
 +from 11 to 15 bits.  And the same applies if you have RLE starting at the
 +character. ​ Then you just find the least-expensive way to get from the
 +start of the file to the end and you have found the optimal encoding. ​ In
 +this case the last characters " sting" would be optimally encoded with
 +8(literal " ") + 15("​sting"​) = 23 instead of 11(" s") + 13("​ting"​) = 24
 +The algorithm can be written either cleverly or not-so. ​ We can take a real
 +short-cut compared to a full-blown graph search because we can/need to only
 +go forwards in the file:  we can simply start from the end!  Our accounting
 +information which is updated when we pass each location in the data
 +consists of three values:
 + 1. the minimum bits from this location to the end of file.
 + 2. the mode (literal, LZ77 or RLE) to use to get that minimum
 + 3. the "​jump"​ length for LZ77 and RLE
 +For each location we try to jump forward (to a location we already
 +processed) one location, LZ77 match length locations (if a match exists),
 +or RLE length locations (if equal bytes follow) and select the shortest
 +route, update the tables accordingly. ​ In addition, if we have a LZ77 or
 +RLE length of for example 18, we also check jumps 17, 16, 15, ...  This
 +gives a little extra compression. ​ Because we are doing the "tree traverse"​
 +starting from the "​leaves",​ we only need to visit/​process each location
 +once.  Nothing located after the current location can't change, so there is
 +never any need to update a location.
 +To be able to find the minimal path, the algorithm needs the length of the
 +RLE (the number of the identical bytes following) and the maximum LZ77
 +length/​offset (an identical string appearing earlier in the file) for each
 +byte/​location in the file.  This is the most time-consuming -and-
 +memory-consuming part of the compression. ​ I have used several methods to
 +make the search faster. ​ See String Match Speedup later in this
 +article. ​ Fortunately these searches can be done first, and the actual
 +optimization can use the cached values.
 +Then what is the rationale behind this optimization? ​ It works because you
 +are not forced to take every compression opportunity,​ but select the best
 +ones.  The compression community calls this "lazy coding"​ or "​non-greedy"​
 +selection. ​ You may want to emit a literal byte even if there is a 2-byte
 +LZ77 match, because in the next position in the file you may have a longer
 +match. ​ This is actually more complicated than that, but just take my word
 +for it that there is a difference. ​ Not a very big difference, and only
 +significant for variable-length code, but it is there and I was after every
 +last bit of compression,​ remember.
 +Note that the decision-making between primaries is quite simple if a
 +fixed-length code is used.  A one-step lookahead is enough to guarantee
 +optimal parsing. ​ If there is a more advantageous match in the next
 +location, we output a literal byte and that longer match instead of the
 +shorter match. ​ I don't have time or space here to go very deeply on that,
 +but the main reason is that in fixed-length code it doesn'​t matter whether
 +you represent a part of data as two matches of lengths 2 and 8 or as
 +matches of lengths 3 and 7 or as any other possible combination (if matches
 +of those lengths exist). ​ This is not true for a variable-length code
 +and/or a statistical compression backend. ​ Different match lengths and
 +offsets no longer generate equal-length codes.
 +Note also that most LZ77 compression algorithms need at least 3-byte match
 +to break even, i.e.  not expanding the data.  This is not surprising when
 +you stop to think about it.  To gain something from 2-byte matches you need
 +to encode the LZ77 match into 15 bits.  This is very little. ​ A generic
 +LZ77 compressor would use one bit to select between a literal and LZ77, 12
 +bits for moderate offset, and you have 2 bits left for match length. ​ I
 +imagine the rationale to exclude 2-byte matches also include "the potential
 +savings percentage for 2-byte matches is insignificant"​. ​ Pucrunch gets
 +around this by using the tag system and Elias Gamma Code, and does indeed
 +gain bits from even 2-byte matches.
 +After we have decided on what primaries to output, we still have to make
 +sure we get the best results from the literal tag system. ​ Escape
 +optimization handles this.  In this stage we know which parts of the data
 +are emitted as literal bytes and we can select the minimal path from the
 +first literal byte to the last in the same way we optimized the primaries.
 +Literal bytes that match the escape code generate an escape sequence, thus
 +using more bits than unescaped literal bytes, and we need to minimize these
 +For each literal byte there is a corresponding new escape code which
 +minimizes the path to the end of the file.  If the literal byte's high bits
 +match the current escape code, this new escape code is used next.  The
 +escape optimization routine proceeds from the end of the file to the
 +beginning like the graph search, but it proceeds linearly and is thus much
 +I already noted that the new literal byte tagging system exploits the
 +locality in the literal byte values. ​ If there is no correlation between
 +the bytes, the tagging system does not do well at all.  Most of the time,
 +however, the system works very well, performing 50% better than the
 +prefix-bit approach.
 +The escape optimization routine is currently very fast.  A little
 +algorithmic magic removed a lot of code from the original version. ​ A fast
 +escape optimization routine is quite advantageous,​ because the number of
 +escape bits can now vary from 0 (uncompressed bytes always escaped) to 8
 +and we need to run the routine again if we change the number of escape bits
 +used to select the optimal escape code changes.
 +Because escaped literal bytes actually expand the data, we need a safety
 +area, or otherwise the compressed data may get overwritten by the
 +decompressed data before we have used it.  Some extra bytes need to be
 +reserved for the end of file marker. ​ The compression routine finds out how
 +many bytes we need for safety buffer by keeping track of the difference
 +between input and output sizes while creating the compressed file.
 +            $1000 .. $2000
 +            |OOOOOOOO| ​          ​O=original file
 +        $801 ..
 +        |D|CCCCC| ​               C=compressed data (D=decompressor)
 + ​$f7.. ​     $1000     $2010
 + ​|D| ​           |CCCCC| ​         Before decompression starts
 +            ^    ^
 +            W    R               ​W=write pointer, R=read pointer
 +If the original file is located at $1000-$1fff,​ and the calculated safety
 +area is 16 bytes, the compressed version will be copied by the
 +decompression routine higher in memory so that the last byte is at $200f.
 +In this way, the minimum amount of other memory is overwritten by the
 +decompression. ​ If the safety are would exceed the top of memory, we need a
 +wrap buffer. ​ This is handled automatically by the compressor. ​ The read
 +pointer wraps from the end of memory to the wrap buffer, allowing the
 +original file to extend up to the end of the memory, all the way to $ffff.
 +You can get the compression program to tell you which memory areas it uses
 +by specifying the "​-s"​ option. ​ Normally the safety buffer needed is less
 +than a dozen bytes.
 +To sum things up, Pucrunch operates in several steps:
 + 1. Find RLE and LZ77 data, pre-select RLE byte table
 + 2. Graph search, i.e.  which primaries to use
 + 3. Primaries/​Literal bytes ratio decides how many escape bits to use
 + 4. Escape optimization,​ which escape codes to use
 + 5. Update RLE ranks and the RLE byte table
 + 6. Determine the safety area size and output the file.
 +String Match Speedup
 +To be able to select the most efficient combination of primaries we of
 +course first need to find out what kind of primaries are available for
 +selection. ​ If the file doesn'​t have repeated bytes, we can't use RLE.  If
 +the file doesn'​t have repeating byte strings, we can't use LZ77.  This
 +string matching is the most time-consuming operation in LZ77 compression
 +simply because of the amount of the comparison operations needed. ​ Any
 +improvement in the match algorithm can decrease the compression time
 +considerably. ​ Pucrunch is a living proof on that.
 +The RLE search is straightforward and fast:  loop from the current position
 +(P) forwards in the file counting each step until a different-valued byte
 +is found or the end of the file is reached. ​ This count can then be used as
 +the RLE byte count (if the graph search decides to use RLE).  The code can
 +also be optimized to initialize counts for all locations that belonged to
 +the RLE, because by definition there are only one-valued bytes in each one.
 +Let us mark the current file position by P.
 + ​unsigned char *a = indata + P, val = *a++;
 + int top = inlen - P;
 + int rlelen = 1;
 + /* Loop for the whole RLE */
 + ​while(rlelen<​top && *a++ == val)
 +     ​rlelen++;​
 + ​for(i=0;​i<​rlelen-1;​i++)
 +     ​rle[P+i] = rlelen-i;
 +With LZ77 we can't use the same technique as for RLE (i.e.  using the
 +information about current match to skip subsequent file locations to speed
 +up the search). ​ For LZ77 we need to find the longest possible, and
 +-nearest- possible, string that matches the bytes starting from the current
 +location. ​ The nearer the match, the less bits are needed to represent the
 +offset from the current position.
 +Naively, we could start comparing the strings starting from P-1 and P,
 +remembering the length of the matching part and then doing the same at P-2
 +and P, P-3 and P, ..  P-j and P (j is the maximum search offset). ​ The
 +longest match and its location (offset from the current position) are then
 +remembered and initialized. ​ If we find a match longer or equal than the
 +maximum length we can actually use, we can stop the search there. ​ (The
 +code used to represent the length values may have an upper limit.)
 +This may be the first implementation that comes to your (and my) mind, and
 +might not seem so bad at first. ​ In reality, it is a very slow way to do
 +the search: the -Brute Force- method. ​ It could take somewhere about (n^3)
 +byte compares to process a file of the length n (a mathematically inclined
 +person would probably give a better estimate). ​ However, using the already
 +determined RLE value to our advantage permits us to rule out the worst-case
 +projection, which happens when all bytes are the same value. ​ We only
 +search LZ77 matches if the current file position has shorter RLE sequence
 +than the maximum LZ77 copy length.
 +The first thing I did to improve the speed is to remember the position
 +where each byte has last been seen.  A simple 256-entry table handles that.
 +Using this table, the search can directly start from the first potential
 +match, and we don't need to search for it byte-by-byte anymore. ​ The table
 +is continually updated when we move toward to the end of the file.
 +That didn't give much of an improvement,​ but then I increased the table to
 +256*256 entries, making it possible to locate the latest occurrence of any
 +byte -pair- instead. ​ The table indexed with the byte values and the table
 +contents directly gives the position in file where these two bytes were
 +last seen.  Because the shortest possible string that would offer any
 +compression (for my encoding of LZ77) is two bytes long, this byte-pair
 +history is very suitable indeed. ​ Also, the first (shortest possible, i.e.
 +2-byte) match is found directly from the byte-pair history. ​ This gave a
 +moderate 30% decrease in compression time for one of my test files (from 28
 +minutes to 17 minutes on a 25 MHz 68030).
 +The second idea was to quickly discard the strings that had no chance of
 +being longer matches than the one already found. ​ A one-byte hash value
 +(sort of a checksum here, it is never used to index a hash table in this
 +algorithm, but I rather use "hash value" than "​checksum"​) is calculated
 +from each three bytes of data.  The values are calculated once and put into
 +a table, so we only need two memory fetches to know if two 3-byte strings
 +are different. ​ If the hash values are different, at least one of the data
 +bytes differ. ​ If the hash values are equal, we have to compare the
 +original bytes. ​ The hash values of the strategic positions of the strings
 +to compare are then ..  compared. ​ This strategic position is the location
 +two bytes earlier than the longest match so far.  If the hash values
 +differ, there is no chance that the match is longer than the current one.
 +It may be not even be as long, because one of the two earlier bytes may be
 +different. ​ If the hash values are equal, the brute-force byte-by-byte
 +compare has to be done.  However, the hash value check already discards a
 +huge number of candidates and more than generously pays back its own memory
 +references. ​ Using the hash values the compression time shortens by 50%
 +(from 17 minutes to 8 minutes).
 +Okay, the byte-pair table tells us where the latest occurrence of any byte
 +pair is located. ​ Still, for the latest occurrence before -that- one we
 +have to do a brute force search. ​ The next improvement was to use the
 +byte-pair table to generate a linked list of the byte pairs with the same
 +value. ​ In fact, this linked list can be trivially represented as a table,
 +using the same indexing as the file positions. ​ To locate the previous
 +occurrence of a 2-byte string starting at location P, look at backSkip[P].
 + /* Update the two-byte history & backSkip */
 + ​if(P+1<​inlen)
 + {
 +     int index = (indata[P]<<​8) | indata[P+1];​
 +     ​backSkip[P] = lastPair[index];​
 +     ​lastPair[index] = P+1;
 + }
 +Actually the values in the table are one bigger than the real table
 +indices. ​ This is because the values are of type unsigned short (can only
 +represent non-negative values), and I wanted zero to mean "not occurred"​.
 +This table makes the search of the next (previous) location to consider
 +much faster, because it is a single table reference. ​ The compression time
 +was reduced from 6 minutes to 1 minute 10 seconds. ​ Quite an improvement
 +from the original 28 minutes!
 +     ​backSkip[] ​  ​lastPair[]
 + ​___ ​ _______ ​ ____
 +    \/       ​\/ ​   \
 + ​...JOVE.....JOKA..JOKER
 +    ^        ^     ^
 +    |        |     |
 +    next     ​| ​    ​position
 +             ​current match (3)
 +    C        B     A
 +In this example we are looking at the string "​JOKER"​ at location A.  Using
 +the lastPair[] table (with the index "​JO",​ the byte values at the current
 +location A) we can jump directly to the latest match at B, which is "​JO",​ 2
 +bytes long.  The hash values for the string at B ("​JOK"​) and at A ("​JOK"​)
 +are compared. ​ Because they are equal, we have a potential longer match (3
 +bytes), and the strings "​JOKE.."​ and "​JOKA.."​ are compared. ​ A match of
 +length 3 is found (the 4th byte differs). ​ The backSkip[] table with the
 +index B gives the previous location where the 2-byte string "​JO"​ can be
 +found, i.e.  C.  The hash value for the strategic position of the string in
 +the current position A ("​OKE"​) is then compared to the hash value of the
 +corresponding position in the next potential match starting at C ("​OVE"​).
 +They don't match, so the string starting at C ("​JOVE.."​) can't include a
 +longer match than the current longest match at B.
 +There is also another trick that takes advantage of the already determined
 +RLE lengths. ​ If the RLE lengths for the positions to compare don't match,
 +we can directly skip to the next potential match. ​ Note that the RLE bytes
 +(the data bytes) are the same, and need not be compared, because the first
 +byte (two bytes) are always equal on both positions (our backSkip[] table
 +guarantees that). ​ The RLE length value can also be used to skip the start
 +of the strings when comparing them.
 +Another improvement to the search code made it dramatically faster than
 +before on highly redundant files (such as pic from the Calgary Corpus
 +Suite, which was the Achilles'​ heel until then). ​ Basically the new search
 +method just skips over the RLE part (if any) in the search position and then
 +checks if the located position has equal number (and value) of RLE bytes
 +before it.
 +backSkip[] ​     lastPair[]
 +     ​_____ ​ ________
 +          \/        \
 +       ​...AB.....A..ABCD ​   rle[p] # of A's, B is something else
 +          ^      ^  ^
 +          |      |  |
 +          i      p  p+rle[p]-1
 +The algorithm searches for a two-byte string which starts at p + rle[p]-1,
 +i.e.  the last rle byte ('​A'​) and the non-matching one ('​B'​). ​ When it
 +finds such location (simple lastPair[] or backSkip[] lookup), it checks if
 +the rle in the compare position (i-(rle[p]-1)) is long enough (i.e.  the
 +same number of A's before the B in both places). ​ If there are, the normal
 +hash value check is performed on the strings and if it succeeds, the
 +brute-force byte-compare is done.
 +The rationale behind this idea is quite simple. ​ There are dramatically
 +less matches for "​AB"​ than for "​AA",​ so we get a huge speedup with this
 +approach. ​ We are still guaranteed to find the most recent longest match
 +there is.
 +Note that a compression method similar to RLE can be realized using just
 +LZ77.  You just emit the first byte as a literal byte, and output a LZ77
 +code with offset 1 and the original RLE length minus 1.  You can thus
 +consider RLE as a special case, which offers tighter encoding of the
 +necessary information. ​ Also, as my LZ77 limits the copy size to 64/128/256
 +bytes, a RLE version providing lengths up to 32 kilobytes is a big
 +improvement,​ even if the code for it is somewhat longer.
 +The Decompression Routine
 +Any lossless compression program is totally useless unless there exists a
 +decompression program which takes in the compressed file and -- using only
 +that information -- generates the original file exactly. ​ In this case the
 +decompression program must run on C64's 6510 microprocessor,​ which had its
 +impact on the algorithm development also.  Regardless of the algorithm,
 +there are several requirements that the decompression code must satisfy:
 + 1. Correctness - the decompression must behave accurately to
 +    guarantee lossless decompression
 + 2. Memory usage - the less memory is used the better
 + 3. Speed - fast decompression is preferred to slower one
 +The latter two requirements can be and are complementary. ​ A somewhat
 +faster decompression for the same algorithm is possible if more memory can
 +be used (although in this case the difference is quite small). ​ In any case
 +the correctness of the result is the most important thing.
 +A short pseudo-code of the decompression algorithm follows before I go to
 +the actual C64 decompression code.
 +     copy the decompression code to low memory
 +     copy the compressed data forward in memory so that it isn't
 +       ​overwritten before we have read it (setup safety & wrap buffers)
 +     setup source and destination pointers
 +     ​initialize RLE byte code table, the number of escape bits etc.
 +     set initial escape code
 +     do forever
 +         get the number of escape bits "​bits"​
 +         if "​bits"​ do not match with the escape code
 +             read more bits to complete a byte and output it
 +         else
 +             get Elias Gamma Code "​value"​ and add 1 to it
 +             if "​value"​ is 2
 +                 get 1 bit
 +                 if bit is 0
 +                     it is 2-byte LZ77
 +                     get 8 bits for "​offset"​
 +                     copy 2 bytes from "​offset"​ bytes before current
 +                         ​output position into current output position
 +                 else
 +                     get 1 bit
 +                     if bit is 0
 +                         it is an escaped literal byte
 +                         get new escape code
 +                         get more bits to complete a byte with the
 +                             ​current escape code and output it
 +                         use the new escape code
 +                     else
 +                         it is RLE
 +                         get Elias Gamma Code "​length"​
 +                         if "​length"​ larger or equal than half the maximum
 +                             it is long RLE
 +                             get more bits to complete a byte "​lo"​
 +                             get Elias Gamma Code "​hi",​ subtract 1
 +                             ​combine "​lo"​ and "​hi"​ into "​length"​
 +                         endif
 +                         get Elias Gamma Code "​index"​
 +                         if "​index"​ is larger than 31
 +                             get 3 more bits to complete "​byte"​
 +                         else
 +                             get "​byte"​ from RLE byte code table from
 +                                 index "​index"​
 +                         endif
 +                         copy "​byte"​ to the output "​length"​ times
 +                     endif
 +                 endif
 +             else
 +                 it is LZ77
 +                 get Elias Gamma Code "​hi"​ and subtract 1 from it
 +                 if "​hi"​ is the maximum value - 1
 +                     end decompression and start program
 +                 endif
 +                 get 8..12 bits "​lo"​ (depending on settings)
 +                 ​combine "​hi"​ and "​lo"​ into "​offset"​
 +                 copy "​value"​ number of bytes from "​offset"​ bytes before
 +                     ​current output position into current output position
 +             endif
 +         endif
 +     end do
 +The following routine is the pucrunch decompression code.  The code runs on
 +the C64 or C128's C64-mode and a modified version is used for Vic20 and
 +C16/​Plus4. ​ It can be compiled by at least DASM V2.12.04. ​ Note that the
 +compressor automatically attaches this code to the packet and sets the
 +different parameters to match the compressed data.  I will insert
 +additional commentary between strategic code points in addition to the
 +comments that are already in the code.
 +Note that at this point it is only possible to make the decompression code
 +shorter by removing features. ​ At least I think that it is now so.  If I'm
 +wrong, feel free to point it out to me.  Tim Rogers <​>​
 +did manage to snip off 2 bytes, thanks! ​ However, there are some features
 +you may consider unnecessary. ​ The code can be shortened by:
 +  * No basic end address set: 8 bytes
 +  * No 2 MHz mode set/reset: 6 bytes
 +  * No wrap option: 12 bytes
 +Actually, if the wrap option is not used, the compressor automatically
 +selects the shorter decompression code (only for the C64 version).
 +        processor 6502
 +BASEND ​ EQU $2d         ; start of basic variables (updated at EOF)
 +LZPOS   EQU $2d         ; temporary, BASEND *MUST* *BE* *UPDATED* at EOF
 +bitstr ​ EQU $f7         ; Hint the value beforehand
 +xstore ​ EQU $c3         ; tape load temp
 +WRAPBUF EQU $004b       ; '​wrap'​ buffer, 22 bytes ($02a7 for 89 bytes)
 +        ORG $0801
 +        DC.B $0b,​8,​$ef,​0 ​       ; '239 SYS2061'​
 +        DC.B $9e,​$32,​$30,​$36
 +        DC.B $31,0,0,0
 +        sei             ; disable interrupts
 +        inc $d030       ; or "bit $d030" if 2MHz mode is not enabled
 +        inc 1           ; Select ALL-RAM configuration
 +        ldx #0          ;** parameter - # of overlap bytes-1 off $ffff
 +overlap lda $aaaa,​x ​    ;** parameter start of off-end bytes
 +        sta WRAPBUF,​x ​  ; Copy to wrap/safety buffer
 +        dex
 +        bpl overlap
 +        ldx #​block200-end-block200+1 ​   ; $54   ($59 max)
 +packlp ​ lda block200-1,​x
 +        sta block200--1,​x
 +        dex
 +        bne packlp
 +        ldx #​block-stack-end-block-stack+1 ​     ; $b3   ​(stack! ~$e8 max)
 +packlp2 lda block-stack-1,​x
 +        dc.b $9d                ; sta $nnnn,x
 +        dc.w block-stack--1 ​    ; (ZP addressing only addresses ZP!)
 +        dex
 +        bne packlp2
 +        ldy #$aa        ;** parameter SIZE high + 1 (max 255 extra bytes)
 +cploop ​ dex             ; ldx #$ff on the first round
 +        lda $aaaa,​x ​    ;** parameter DATAEND-0x100
 +        sta $ff00,​x ​    ;** parameter ORIG LEN-0x100+ reserved bytes
 +        txa             ;cpx #0
 +        bne cploop
 +        dec cploop+6
 +        dec cploop+3
 +        dey
 +        bne cploop
 +        jmp main
 +The first part of the code contains a sys command for the basic
 +interpreter,​ two loops that copy the decompression code to zeropage/​stack
 +($f7-$1aa) and to the system input buffer ($200-$253). ​ The latter code
 +segment contains byte, bit and Gamma Code input routines and the RLE byte
 +code table, the former code segment contains the rest.
 +This code also copies the compressed data forward in memory so that it
 +won't be overwritten by the decompressed data before we have had a change
 +to read it.  The decompression starts at the beginning and proceeds upwards
 +in both the compressed and decompressed data.  A safety area is calculated
 +by the compression routine. ​ It finds out how many bytes we need for
 +temporary data expansion, i.e.  for escaped bytes. ​ The wrap buffer is used
 +for files that extend upto the end of memory, and would otherwise overwrite
 +the compressed data with decompressed data before it has been read.
 +This code fragment is not used during the decompression itself. ​ In fact
 +the code will normally be overwritten when the actual decompression starts.
 +The very start of the next code block is located inside the zero page and
 +the rest fills the lowest portion of the microprocessor stack. ​ The zero
 +page is used to make the references to different variables shorter and
 +faster. ​ Also, the variables don't take extra code to initialize, because
 +they are copied with the same copy loop as the rest of the code.
 +#rorg $f7       ; $f7 - ~$1e0
 +bitstr ​ dc.b $80        ; ZP    $80 == Empty
 +esc     dc.b $00        ; ** parameter (saves a byte when here)
 +OUTPOS = *+1            ; ZP
 +putch   sta $aaaa       ; ** parameter
 +        inc OUTPOS ​     ; ZP
 +        bne 0$          ; Note: beq 0$; rts; 0$: inc OUTPOS+1;​rts would be
 +; $0100                 ; ​      ​faster,​ but 1 byte longer
 +        inc OUTPOS+1 ​   ; ZP
 +0$      rts
 +putch is the subroutine that is used to output the decompressed bytes. ​ In
 +this case the bytes are written to memory. ​ Because the subroutine call
 +itself takes 12 cycles (6 for jsr and another 6 for rts), and the routine
 +is called a lot of times during the decompression,​ the routine itself
 +should be as fast as possible. ​ This is achieved by removing the need to
 +save any registers. ​ This is done by using an absolute addressing mode
 +instead of indirect indexed or absolute indexed addressing (sta $aaaa
 +instead of sta ($zz),y or sta $aa00,​y). ​ With indexed addressing you would
 +need to save+clear+restore the index register value in the routine.
 +Further improvement in code size and execution speed is done by storing the
 +instruction that does the absolute addressing to zero page.  When the
 +memory address is incremented we can use zero-page addressing for it too.
 +On the other hand, the most time is spent in the bit input routine so
 +further optimization of this routine is not feasible.
 +newesc ​ ldy esc         ; remember the old code (top bits for escaped byte)
 +        ldx #2          ; ** PARAMETER
 +        jsr getchkf ​    ; get & save the new escape code
 +        sta esc
 +        tya             ; pre-set the bits
 +        ; Fall through and get the rest of the bits.
 +noesc   ldx #6          ; ** PARAMETER
 +        jsr getchkf
 +        jsr putch       ; output the escaped/​normal byte
 +        ; Fall through and check the escape bits again
 +main    ldy #0          ; Reset to a defined state
 +        tya             ; A = 0
 +        ldx #2          ; ** PARAMETER
 +        jsr getchkf ​    ; X=2 -> X=0
 +        cmp esc
 +        bne noesc       ; Not the escape code -> get the rest of the byte
 +        ; Fall through to packed code
 +The decompression code is first entered in main.  It first clears the
 +accumulator and the Y register and then gets the escape bits (if any are
 +used) from the input stream. ​ If they don't match with the current escape
 +code, we get more bits to complete a byte and then output the result. ​ If
 +the escape bits match, we have to do further checks to see what to do.
 +        jsr getval ​     ; X = 0
 +        sta xstore ​     ; save the length for a later time
 +        cmp #1          ; LEN == 2 ?
 +        bne lz77        ; LEN != 2      -> LZ77
 +        tya             ; A = 0
 +        jsr get1bit ​    ; X = 0
 +        lsr             ; bit -> C, A = 0
 +        bcc lz77-2 ​     ; A=0 -> LZPOS+1
 +        ;***FALL THRU***
 +We first get the Elias Gamma Code value (or actually my independently
 +developed version). ​ If it says the LZ77 match length is greater than 2, it
 +means a LZ77 code and we jump to the proper routine. ​ Remember that the
 +lengths are decremented before encoding, so the code value 1 means the
 +length is 2.  If the length is two, we get a bit to decide if we have LZ77
 +or something else.  We have to clear the accumulator,​ because get1bit does
 +not do that automatically.
 +If the bit we got (shifted to carry to clear the accumulator) was zero, it
 +is LZ77 with an 8-bit offset. ​ If the bit was one, we get another bit which
 +decides between RLE and an escaped byte.  A zero-bit means an escaped byte
 +and the routine that is called also changes the escape bits to a new value.
 +A one-bit means either a short or long RLE.
 +        ; e..e01
 +        jsr get1bit ​    ; X = 0
 +        lsr             ; bit -> C, A = 0
 +        bcc newesc ​     ; e..e010
 +        ;***FALL THRU***
 +        ; e..e011
 +srle    iny             ; Y is 1 bigger than MSB loops
 +        jsr getval ​     ; Y is 1, get len, X = 0
 +        sta xstore ​     ; Save length LSB
 +        cmp #64         ; ** PARAMETER 63-64 -> C clear, 64-64 -> C set..
 +        bcc chrcode ​    ; short RLE, get bytecode
 +longrle ldx #2          ; ** PARAMETER ​ 111111xxxxxx
 +        jsr getbits ​    ; get 3/2/1 more bits to get a full byte, X = 0
 +        sta xstore ​     ; Save length LSB
 +        jsr getval ​     ; length MSB, X = 0
 +        tay             ; Y is 1 bigger than MSB loops
 +The short RLE only uses half (or actually 1 value less than a half) of the
 +gamma code range. ​ Larger values switches us into long RLE mode.  Because
 +there are several values, we already know some bits of the length value.
 +Depending on the gamma code maximum value we need to get from one to three
 +bits more to assemble a full byte, which is then used as the less
 +significant part for the run length count. ​ The upper part is encoded using
 +the same gamma code we are using everywhere. ​ This limits the run length to
 +16 kilobytes for the smallest maximum value (-m5) and to the full 64
 +kilobytes for the largest value (-m7).
 +Additional compression for RLE is gained using a table for the 31
 +top-ranking RLE bytes. ​ We get an index from the input. ​ If it is from 1 to
 +31, we use it to index the table. ​ If the value is larger, the lower 5 bits
 +of the value gives us the 5 most significant bits of the byte to repeat.
 +In this case we read 3 additional bits to complete the byte.
 +chrcode jsr getval ​     ; Byte Code, X = 0
 +        tax             ; this is executed most of the time anyway
 +        lda table-1,​x ​  ; Saves one jump if done here (loses one txa)
 +        cpx #32         ; 31-32 -> C clear, 32-32 -> C set..
 +        bcc 1$          ; 1..31 -> the byte to repeat is in A
 +        ; Not ranks 1..31, -> 111110xxxxx (32..64), get byte..
 +        txa             ; get back the value (5 valid bits)
 +        jsr get3bit ​    ; get 3 more bits to get a full byte, X = 0
 +1$      ldx xstore ​     ; get length LSB
 +        inx             ; adjust for cpx#$ff;bne -> bne
 +dorle   jsr putch
 +        dex
 +        bne dorle       ; xstore 0..255 -> 1..256
 +        deym
 +        bne dorle       ; Y was 1 bigger than wanted originally
 +mainbeq beq main        ; reverse condition -> jump always
 +After deciding the repeat count and decoding the value to repeat we simply
 +have to output the value enough times. ​ The X register holds the lower part
 +and the Y register holds the upper part of the count. ​ The X register value
 +is first incremented by one to change the code sequence dex ; cpx #$ff ;
 +bne dorle into simply dex ; bne dorle. ​ This may seem strange, but it saves
 +one byte in the decompression code and two clock cycles for each byte that
 +is outputted. ​ It's almost a ten percent improvement. ​ :-)
 +The next code fragment is the LZ77 decode routine and it is used in the
 +file parts that do not have equal byte runs (and even in some that have).
 +The routine simply gets an offset value and copies a sequence of bytes from
 +the already decompressed portion to the current output position.
 +lz77    jsr getval ​     ; X=0 -> X=0
 +        cmp #127        ; ** PARAMETER ​ Clears carry (is maximum value)
 +        beq eof         ; EOF
 +        sbc #0          ; C is clear -> subtract 1  (1..126 -> 0..125)
 +        ldx #0          ; ** PARAMETER (more bits to get)
 +        jsr getchkf ​    ; clears Carry, X=0 -> X=0
 +lz77-2 ​ sta LZPOS+1 ​    ; offset MSB
 +        ldx #8
 +        jsr getbits ​    ; clears Carry, X=8 -> X=0
 +                        ; Note: Already eor:ed in the compressor..
 +        ;eor #255       ; offset LSB 2's complement -1 (i.e. -X = ~X+1)
 +        adc OUTPOS ​     ; -offset -1 + curpos (C is clear)
 +        sta LZPOS
 +        lda OUTPOS+1
 +        sbc LZPOS+1 ​    ; takes C into account
 +        sta LZPOS+1 ​    ; copy X+1 number of chars from LZPOS to OUTPOS
 +        ;ldy #0         ; Y was 0 originally, we don't change it
 +        ldx xstore ​     ; LZLEN
 +        inx             ; adjust for cpx#$ff;bne -> bne
 +lzloop ​ lda (LZPOS),y
 +        jsr putch
 +        iny             ; Y does not wrap because X=0..255 and Y initially 0
 +        dex
 +        bne lzloop ​     ; X loops, (256,​1..255)
 +        beq mainbeq ​    ; jump through another beq (-1 byte, +3 cycles)
 +There are two entry-points to the LZ77 decode routine. ​ The first one
 +(lz77) is for copy lengths bigger than 2.  The second entry point (lz77-2)
 +is for the length of 2 (8-bit offset value).
 +        ; EOF
 +eof     lda #$37        ; ** could be a PARAMETER
 +        sta 1
 +        dec $d030       ; or "bit $d030" if 2MHz mode is not enabled
 +        lda OUTPOS ​     ; Set the basic prg end address
 +        sta BASEND
 +        lda OUTPOS+1
 +        sta BASEND+1
 +        cli             ; ** could be a PARAMETER
 +        jmp $aaaa       ; ** PARAMETER
 +Some kind of a end of file marker is necessary for all variable-length
 +codes. ​ Otherwise we could not be certain when to stop decoding. ​ Sometimes
 +the byte count of the original file is used instead, but here a special EOF
 +condition is more convenient. ​ If the high part of a LZ77 offset is the
 +maximum gamma code value, we have reached the end of file and must stop
 +decoding. ​ The end of file code turns on BASIC and KERNEL, turns off 2 MHz
 +mode (for C128) and updates the basic end addresses before allowing
 +interrupts and jumping to the program start address.
 +The next code fragment is put into the system input buffer. ​ The routines
 +are for getting bits from the encoded message (getbits) and decoding the
 +Elias Gamma Code (getval). ​ The table at the end contains the ranked RLE
 +bytes. ​ The compressor automatically decreases the table size if not all of
 +the values are used.
 +#rorg   ​$200 ​   ; $200-$258
 +getnew ​ pha             ; 1 Byte/3 cycles
 +INPOS = *+1
 +        lda $aaaa       ;** parameter
 +        rol             ; Shift in C=1 (last bit marker)
 +        sta bitstr ​     ; bitstr initial value = $80 == empty
 +        inc INPOS       ; Does not change C!
 +        bne 0$
 +        inc INPOS+1 ​    ; Does not change C!
 +        bne 0$
 +        ; This code does not change C!
 +        lda #​WRAPBUF ​   ; Wrap from $ffff->​$0000 -> WRAPBUF
 +        sta INPOS
 +0$      pla             ; 1 Byte/4 cycles
 +        rts
 +; getval : Gets a '​static huffman coded' value
 +; ** Scratches X, returns the value in A **
 +getval ​ inx             ; X must be 0 when called!
 +        txa             ; set the top bit (value is 1..255)
 +0$      asl bitstr
 +        bne 1$
 +        jsr getnew
 +1$      bcc getchk ​     ; got 0-bit
 +        inx
 +        cpx #7          ; ** parameter
 +        bne 0$
 +        beq getchk ​     ; inverse condition -> jump always
 +; getbits: Gets X bits from the stream
 +; ** Scratches X, returns the value in A **
 +get1bit inx             ;2
 +getbits asl bitstr
 +        bne 1$
 +        jsr getnew
 +1$      rol             ;2
 +getchk ​ dex             ;​2 ​             more bits to get ?
 +getchkf bne getbits ​    ;2/3
 +        clc             ;​2 ​             return carry cleared
 +        rts             ;6+6
 +table   dc.b 0,​0,​0,​0,​0,​0,​0
 +        dc.b 0,​0,​0,​0,​0,​0,​0,​0
 +        dc.b 0,​0,​0,​0,​0,​0,​0,​0
 +        dc.b 0,​0,​0,​0,​0,​0,​0,​0
 +Target Application Compression Tests
 +The following data compression tests are made on my four C64 test files:
 +   ​bs.bin is a demo part, about 50% code and 50% graphics data
 +   ​delenn.bin is a BFLI picture with a viewer, a lot of dithering
 +   ​sheridan.bin is a BFLI picture with a viewer, dithering, black areas
 +   ​ivanova.bin is a BFLI picture with a viewer, dithering, larger black areas
 +   ​Packer ​                ​Size ​    ​Left ​   Comment
 +   ​===============================================
 +   ​bs.bin ​                41537
 +   ​-----------------------------------------------
 +   ​ByteBonker 1.5         ​27326 ​  ​65.8% ​   Mode 4
 +   ​Cruelcrunch 2.2        27136   ​65.3% ​   Mode 1
 +   The AB Cruncher ​       27020   65.1%
 +   ​ByteBoiler (REU)       ​26745 ​  64.4%
 +   RLE + ByteBoiler (REU) 26654   64.2%
 +   ​PuCrunch ​              ​26415 ​  ​63.6% ​   -m5
 +   ​===============================================
 +   ​delenn.bin ​            47105
 +   ​-----------------------------------------------
 +   The AB Cruncher ​       N/A       ​N/​A ​   Crashes
 +   ​ByteBonker 1.5         ​21029 ​  ​44.6% ​   Mode 3
 +   ​Cruelcrunch 2.2        20672   ​43.9% ​   Mode 1
 +   ​ByteBoiler (REU)       ​20371 ​  43.2%
 +   RLE + ByteBoiler (REU) 19838   42.1%
 +   ​PuCrunch ​              ​19734 ​  ​41.9% ​   -p2
 +   ​===============================================
 +   ​sheridan.bin ​          47105
 +   ​-----------------------------------------------
 +   ​ByteBonker 1.5         ​13661 ​  ​29.0% ​   Mode 3
 +   ​Cruelcrunch 2.2        13595   ​28.9% ​   Mode H
 +   The AB Cruncher ​       13534   28.7%
 +   ​ByteBoiler (REU)       ​13308 ​  28.3%
 +   ​PuCrunch ​              ​12526 ​  ​26.6% ​   -p2
 +   RLE + ByteBoiler (REU) 12478   26.5%
 +   ​===============================================
 +   ​ivanova.bin ​           47105
 +   ​-----------------------------------------------
 +   ​ByteBonker 1.5         ​11016 ​  ​23.4% ​   Mode 1
 +   ​Cruelcrunch 2.2        10883   ​23.1% ​   Mode H
 +   The AB Cruncher ​       10743   22.8%
 +   ​ByteBoiler (REU)       ​10550 ​  22.4%
 +   ​PuCrunch ​               9844   ​20.9% ​   -p2
 +   RLE + ByteBoiler (REU)  9813   20.8%
 +   ​LhA ​                    ​9543 ​  ​20.3% ​   Decompressor not included
 +   gzip -9                 ​9474 ​  ​20.1% ​   Decompressor not included
 +   ​-----------------------------------------------
 +Calgary Corpus Suite
 +The original compressor only allows files upto 63 kB.  To be able to
 +compare my algorithm to others I modified the compressor to allow bigger
 +files. ​ I then got some reference results using the Calgary Corpus test
 +Note that the decompression code is included in the compressed files,
 +although it is not valid for files over 63k (compressed or uncompressed
 +size). ​ About 34 bytes are decompression parameters, the rest (approx. ​ 300
 +bytes) is 6510 machine language. ​ Kolmogorov complexity, anyone ?:-)
 +To tell you the truth, the results surprised me, because the compression
 +algorithm -IS- developed for a very special case in mind.  It only has a
 +fixed code for LZ77/RLE lengths, not even a static one (fixed != static !=
 +adaptive)! ​ Also, it does not use arithmetic code (or Huffman) to compress
 +the literal bytes. ​ Because most of the big files are ASCII text, this
 +somewhat handicaps my compressor, although the new tagging system is very
 +happy with 7-bit ASCII input. ​ Also, decompression is relatively fast, and
 +uses no extra memory.
 +I'm getting relatively near LhA, and shorter than LhA for 8 files (300-byte
 +decompressor included!), and relatively near or shorter than LhA in other
 +cases if the decompressor is removed.
 +The table contains the file name (file), compression options (options), the
 +original file size (in) and the compressed file size (out) in bytes,
 +average number of bits used to encode one byte (b/B), remaining size
 +(ratio) and the reduction (gained), and the time used for compression. ​ For
 +comparison, the last three columns show the compressed sizes for LhA, Zip
 +and GZip (with the -9 option), respectively.
 +FreeBSD PentiumPro® 200MHz
 +Estimated decompression on a C64 (1MHz 6510) 6:47       ​LhA ​   Zip GZip-9
 +file   ​options ​   in     ​out ​ b/B  ratio gained time    out    out    out
 +bib    -p4    111261 ​  35457 2.55 31.87% 68.13% ​ 8.3  40740  35041  34900
 +book1  -p4    768771 ​ 318919 3.32 41.49% 58.51% 65.1 339074 313352 312281
 +book2  -p4    610856 ​ 208627 2.74 34.16% 65.84% 43.5 228442 206663 206158
 +geo    -p2    102400 ​  72812 5.69 71.11% 28.89% 11.4  68574  68471  68414
 +news   ​-p3 ​   377109 ​ 144566 3.07 38.34% 61.66% 15.2 155084 144817 144400
 +obj1   ​-m6 ​    ​21504 ​  10750 4.00 50.00% 50.00% ​ 0.1  10310  10300  10320
 +obj2          246814 ​  83046 2.70 33.65% 66.35% 13.5  84981  81608  81087
 +paper1 -p2     ​53161 ​  19536 2.94 36.75% 63.25% ​ 1.5  19676  18552  18543
 +paper2 -p3     ​82199 ​  30676 2.99 37.32% 62.68% ​ 4.3  32096  29728  29667
 +paper3 -p2     ​46526 ​  19234 3.31 41.35% 58.65% ​ 1.4  18949  18072  18074
 +paper4 -p1 -m5 13286    6095 3.68 45.88% 54.12% ​ 0.2   ​5558 ​  ​5511 ​  5534
 +paper5 -p1 -m5 11954    5494 3.68 45.96% 54.04% ​ 0.1   ​4990 ​  ​4970 ​  4995
 +paper6 -p2     ​38105 ​  14159 2.98 37.16% 62.84% ​ 0.8  13814  13207  13213
 +pic    -p1    513216 ​  57835 0.91 11.27% 88.73% 23.2  52221  56420  52381
 +progc  -p1     ​39611 ​  14221 2.88 35.91% 64.09% ​ 0.7  13941  13251  13261
 +progl  -p1     ​71646 ​  17038 1.91 23.79% 76.21% ​ 3.8  16914  16249  16164
 +progp          49379   11820 1.92 23.94% 76.06% ​ 1.3  11507  11222  11186
 +trans  -p2     ​93695 ​  19511 1.67 20.83% 79.17% ​ 3.7  22578  18961  18862
 +total        3251493 1089796 2.68 33.52% 66.48% 3:18
 +Canterbury Corpus Suite
 +The following shows the results on the Canterbury corpus. ​ Again, I am
 +quite pleased with the results. ​ For example, pucrunch beats GZip -9 for
 +lcet10.txt if you remove the decompression code.
 +FreeBSD PentiumPro® 200MHz
 +Estimated decompression on a C64 (1MHz 6510) 6:00        LhA    Zip GZip-9
 +file        opt     ​in ​   out  b/B  ratio gained time    out    out    out
 +alice29.txt -p4 152089 ​ 55103 2.90 36.24% 63.76% 11.3  59160  54525  54191
 +ptt5        -p1 513216 ​ 57835 0.91 11.27% 88.73% 23.2  52272  56526  52382
 +fields.c ​        ​11150 ​  3505 2.52 31.44% 68.56% ​ 0.1   ​3180 ​  ​3230 ​  3136
 +kennedy.xls ​   1029744 265887 2.07 25.83% 74.17% ​ 571 198354 206869 209733
 +sum              38240  13334 2.79 34.87% 65.13% ​ 0.6  14016  13006  12772
 +lcet10.txt ​ -p4 426754 144585 2.72 33.89% 66.11% 30.8 159689 144974 144429
 +plrabn12.txt-p4 481861 199134 3.31 41.33% 58.67% 43.6 210132 195299 194277
 +cp.html ​    ​-p1 ​ 24603   8679 2.83 35.28% 64.72% ​ 0.4   ​8402 ​  ​8085 ​  7981
 +grammar.lsp -m5   ​3721 ​  1591 3.43 42.76% 57.24% ​ 0.0   ​1280 ​  ​1336 ​  1246
 +xargs.1 ​    ​-m5 ​  ​4227 ​  2117 4.01 50.09% 49.91% ​ 0.0   ​1790 ​  ​1842 ​  1756
 +asyoulik.txt-p4 125179 ​ 50594 3.24 40.42% 59.58% ​ 7.5  52377  49042  48829
 +total          2810784 802364 2.28 28.55% 71.45% 11:29
 +In this article I have presented a compression program which creates
 +compressed executable files for C64, VIC20 and Plus4/​C16. ​ The compression
 +can be performed on Amiga, MS-DOS/Win machine or any other machine with a
 +C-compiler. ​ A powerful machine allows asymmetric compression: ​ a lot of
 +resources can be used to compress the data while needing minimal resources
 +for decompression. This was one of the design requirements.
 +Two original ideas were presented: ​ a new literal byte tagging system and
 +an algorithm using hybrid RLE and LZ77.  Also, a detailed explanation of
 +the LZ77 string match routine and the optima parsing scheme was presented.
 +The compression ratio and decompression speed is comparable to other
 +compression programs for Commodore 8-bit computers.
 +But what are then the real advantages of pucrunch compared to traditional
 +C64 compression programs in addition to that you can now compress VIC20 and
 +Plus4/C16 programs? ​ Because I'm lousy at praising my own work, I let you
 +see some actual user comments. ​ I have edited the correspondence a little,
 +but I hope he doesn'​t mind.  My comments are marked with an asterisk.
 +Maybe Steve has something to add also?
 +A big advantage is that pucrunch does RLE and LZ in one pass.  For demos I
 +only used a cruncher and did my own RLE routines as it is somewhat annoying
 +to use an external program for this.  These programs require some memory
 +and ZP-addresses like the cruncher does.  So it can easily happen that the
 +decruncher or depacker interfere with your demo-part, if you didn't know
 +what memory is used by the depacker. ​ At least you have more restrictions
 +to care about. ​ With pucrunch you can do RLE and LZ without having too much
 +of these restrictions.
 +* Right, and because pucrunch is designed that way from the start, it can
 +* get better results with one-pass RLE and LZ than doing them separately.
 +* On the other hand it more or less requires that you _don'​t_ RLE-pack the
 +* file first..
 +This is true, we also found that out.  We did a part for our demo which had
 +some tables using only the low-nybble. ​ Also the bitmap had to be filled
 +with a specific pattern. ​ We did some small routines to shorten the part,
 +but as we tried pucrunch, this became obsolete. ​ From 59xxx bytes to 12xxx
 +or 50 blocks, with our own RLE and a different cruncher we got 60 blks!
 +Not bad at all ;)
 +Not to mention that you have the complete and commented source-code for the
 +decruncher, so that you can easily change it to your own needs. ​ And it's
 +not only very flexible, it is also very powerful. ​ In general pucrunch does
 +a better job than ByteBoiler+Sledgehammer.
 +In addition to that pucrunch is of course much faster than crunchers on my
 +C64, this has not only to do with my 486/66 and the use of an HDD.  See, I
 +use a cross-assembler-system,​ and with pucrunch I don't have to transfer
 +the assembled code to my 64, crunch it, and transfer it back to my pc.
 +Now, it's just a simple command-line and here we go...  And not only I can
 +do this, my friend who has an amiga uses pucrunch as well.  This is the
 +first time we use the same cruncher, since I used to take ByteBoiler, but
 +my friend didn't have a REU so he had to try another cruncher.
 +So, if I try to make a conclusion: ​ It's fast, powerful and extremly
 +flexible (thanks to the source-code).
 +Just for your info...
 +We won the demo-competition at the Interjam'​98 and everything that was
 +crunched ran through the hands of pucrunch... ​ Of course, you have been
 +mentioned in the credits. ​ If you want to take a look, search for
 +KNOOPS/​DREAMS,​ which should be on the ftp-servers in some time.
 +So, again, THANKS! ​ :)
 +       ​Ninja/​DREAMS
 +So, what can I possibly hope to add to that, right?:-)
 +If you have any comments, questions, article suggestions or just a general
 +hello brewing in your mind, send me mail or visit my homepage.
 +See you all again in the next issue!
 +Appendix: The Log Book
 +       Tried reverse LZ, i.e.  mirrored history buffer. ​ Gained some
 +       ​bytes,​ but its not really worth it, i.e.  the compress time
 +       ​increases hugely and the decompressor gets bigger.
 +       Tried to have a code to use the last LZ copy position (offset
 +       added to the lastly used LZ copy position). ​ On I gained
 +       57 bytes, but in fact the net gain was only 2 bytes
 +       ​(decompressor becomes ~25 bytes longer, and the lengthening of
 +       the long rle codes takes away the rest 30).
 +       ​Discovered that my representation of integers 1-63 is in fact
 +       an Elias Gamma Code.  Tried Fibonacci code instead, but it was
 +       much worse (~500 bytes on, ~300 bytes on
 +       ​without even counting the expansion of the decompression code.
 +       '​huffman'​ coded RLE byte -> ~70 bytes gain for ​ The RLE
 +       bytes used are ranked, and top 15 are put into a table, which
 +       is indexed by a Elias Gamma Code.  Other RLE bytes get a prefix
 +       "​1111"​.
 +       The number of escape bits used is again selectable. ​ Using only
 +       one escape bit for gains ~150 bytes. ​ If #-option is
 +       not selected, automatically selects the number of escape bits
 +       (is a bit slow).
 +       ​Changed some arrays to short. ​ 17 x inlen + 64kB memory used.
 +       ​opt-escape() only needs two 16-element arrays now and is
 +       ​slightly faster.
 +       Tried to use BASIC ROM as a codebook, but the results were not
 +       so good.  For mostly-graphics files there are no long matches ->
 +       no net gain, for mostly-code files the file itself gives a
 +       ​better codebook.. ​ Not to mention that using the BASIC ROM as a
 +       ​codebook is not 100% compatible.
 +       Tried maxlen 128, but it only gained 17 bytes on,​
 +       and lost ~15 byte on ​ This also increased the LZPOS
 +       ​maximum value from ~16k to ~32k, but it also had little effect.
 +       ​Changed to coding so that LZ77 has the priority. ​ 2-byte LZ
 +       ​matches are coded in a special way without big loss in
 +       ​efficiency,​ and codes also RLE/Escape.
 +       Tried histogram normalization on LZLEN, but it really did not
 +       gain much of anything, not even counting the mapping table from
 +       index to value that is needed.
 +       8..14 bit LZPOS base part.  Automatic selection. ​ Some more bytes
 +       are gained if the proper selection is done before the LZ/RLELEN
 +       ​optimization. ​ However, it can't really be done automatically
 +       ​before that, because it is a recursive process and the original
 +       ​LZ/​RLE lengths are lost in the first optimization..
 +       Found a way to speed up the almost pathological cases by using
 +       the RLE table to skip the matching beginnings.
 +       ​Switched to maximum length of 128 to get better results on the
 +       ​Calgary Corpus test suite.
 +       Made the maximum length adjustable. ​ -m5, -m6, and -m7 select
 +       64, 128 and 256 respectively. ​ The decompression code now allows
 +       ​escape bits from 0 to 8.
 +       ​Optimized the escape optimization routine. ​ It now takes almost
 +       no time at all.  It used a whole lot of time on large escape bit
 +       ​values before. ​ The speedup came from a couple of generic data
 +       ​structure optimizations and loop removals by informal
 +       ​deductions.
 +       ​Figured out another, better way to speed up the pathological
 +       ​cases. ​ Reduced the run time to a fraction of the original time.
 +       All 64k files are compressed under one minute on my 25 MHz
 +       ​68030. ​ pic from the Calgary Corpus Suite is now compressed in
 +       19 seconds instead of 7 minutes (200 MHz Pentium w/ FreeBSD).
 +       ​Compression of (one of my problem cases) was
 +       ​reduced from about 15 minutes to 47 seconds. ​ The compression of
 +       ​ has been reduced from 28 minutes (the first version) to
 +       24 seconds. ​ An excellent example of how the changes in the
 +       ​algorithm level gives the most impressive speedups.
 +       ​Changed the command line switches to use the standard approach.
 +       Now determines the number of bytes needed for temporary data
 +       ​expansion (i.e.  escaped bytes). ​ Warns if there is not enough
 +       ​memory to allow successful decompression on a C64.
 +       Also, now it's possible to decompress the files compressed with
 +       the program (must be the same version). ​ (-u)
 +       Only checks the lengths that are power of two's in
 +       ​OptimizeLength(),​ because it does not seem to be any (much)
 +       worse than checking every length. ​ (Smaller than found maximum
 +       ​lengths are checked because they may result in a shorter file.)
 +       This version (compiled with optimizations on) only spends 27
 +       ​seconds on
 +       ​Removed 4 bytes from the decrunch code (begins to be quite
 +       tight now unless some features are removed) and simultaneously
 +       ​removed a not-yet-occurred hidden bug.
 +       ​Checked the theoretical gain from using the lastly outputted
 +       byte (conditional probabilities) to set the probabilities for
 +       ​normal/​LZ77/​RLE selection. ​ The number of bits needed to code
 +       the selection is from 0.0 to 1.58, but even using arithmetic
 +       code to encode it, the original escape system is only 82 bits
 +       worse (,​ 7881/7963 bits total. ​ The former figure is
 +       ​calculated from the entropy, the latter includes
 +       ​LZ77/​RLE/​escape select bits and actual escapes.
 +       In LZ77 match we now check if a longer match (further away)
 +       ​really gains more bits.  Increase in match length can make the
 +       code 2 bits longer. ​ Increase in match offset can make the code
 +       even longer (2 bits for each magnitude). ​ Also, if LZPOS low
 +       part is longer than 8, the extra bits make the code longer if
 +       the length becomes longer than two.
 +       ​ivanova -5 bytes, sheridan -14, delenn -26, bs -29
 +       When generating the output rescans the LZ77 matches. ​ This is
 +       ​because the optimization can shorten the matches and a shorter
 +       match may be found much nearer than the original longer match.
 +       ​Because longer offsets usually use more bits than shorter ones,
 +       we get some bits off for each match of this kind.  Actually, the
 +       ​rescan should be done in OptimizeLength() to get the most out
 +       of it, but it is too much work right now (and would make the
 +       ​optimize even slower).
 +       4 bytes removed from the decrunch code.  I have to thank Tim
 +       ​Rogers ( for helping with 2 of them.
 +       ​Because SuperCPU doesn'​t work correctly with inc/dec $d030, I
 +       made the 2 MHz user-selectable and off by default. ​ (-f)
 +       Today I found out that most of my fast string matching
 +       ​algorithm matches the one developed by [Fenwick and Gutmann,
 +       ​1994]*. ​ It's quite frustrating to see that you are not a genius
 +       after all and someone else has had the same idea.  :-) However,
 +       using the RLE table to help still seems to be an original idea,
 +       which helps immensely on the worst cases. ​ I still haven'​t read
 +       their paper on this, so I'll just have to get it and see..
 +       * [Fenwick and Gutmann, 1994]. ​ P.M.  Fenwick and P.C.  Gutmann,
 +       "​Fast LZ77 String Matching",​ Dept of Computer Science, The
 +       ​University of Auckland, Tech Report 102, Sep 1994
 +       The new decompression code can decompress files from $258 to
 +       $ffff (or actually all the way upto $1002d :-).  The drawback
 +       ​is: ​ the decompression code became 17 bytes longer. ​ However, the
 +       old decompression code is used if the wrap option is not
 +       ​needed.
 +       The backSkip table can now be fixed size (64 kWord) instead of
 +       ​growing enormous for "​BIG"​ files. ​ Unfortunately,​ if the
 +       ​fixed-size table is used, the LZ77 rescan is impractical (well,
 +       just a little slow, as we would need to recreate the backSkip
 +       table again). ​ On the other hand the rescan did not gain so many
 +       bytes in the first place (percentage). ​ The define BACKSKIP-FULL
 +       ​enables the old behavior (default). ​ Note also, that for smaller
 +       files than 64kB (the primary target files) the default consumes
 +       less memory.
 +       The hash value compare that is used to discard impossible
 +       ​matches does not help much.  Although it halves the number of
 +       ​strings to consider (compared to a direct one-byte compare),
 +       ​speedwise the difference is negligible. ​ I suppose a mismatch is
 +       found very quickly when the strings are compared starting from
 +       the third charater (the two first characters are equal, because
 +       we have a full hash table). ​ According to one test file, on
 +       ​average 3.8 byte-compares are done for each potential match. ​ A
 +       ​define HASH-COMPARE enables (default) the hash version of the
 +       ​compare,​ in which case "​inlen"​ bytes more memory is used.
 +       After removing the hash compare my algorithm quite closely
 +       ​follows the [Fenwick and Gutmann, 1994] fast string matching
 +       ​algorithm (except the RLE trick). ​ (Although I *still* haven'​t
 +       read it.)
 +       14 x inlen + 256 kB of memory is used (with no HASH-COMPARE and
 +       ​without BACKSKIP-FULL).
 +       One byte removed from the decompression code (both versions).
 +       Only records longer matches if they compress better than
 +       ​shorter ones.  I.e.  a match of length N at offset L can be
 +       ​better than a match of length N+1 at 4*L.  The old comparison
 +       was "​better or equal" (">​="​). ​ The new comparison "​better"​ (">"​)
 +       gives better results on all Calgary Corpus files except "​geo",​
 +       which loses 101 bytes (0.14% of the compressed size).
 +       An extra check/​rescan for 2-byte matches in OptimizeLength()
 +       ​increased the compression ratio for "​geo"​ considerably,​ back to
 +       the original and better. ​ It seems to help for the other files
 +       ​also. ​ Unfortunately this only works with the full backskip
 +       table (BACKSKIP-FULL defined).
 +       ​Compression/​Decompression for VIC20 and C16/+4 incorporated
 +       into the same program.
 +       ​Removed two bytes from the decompression codes.
 +       There was a small bug in pucrunch which caused the location
 +       $2c30 to be decremented (dec $2c30 instead of bit $d030) when
 +       run without the -f option. ​ The source is fixed and executables
 +       are now updated.
 +   1. http://​​~albert/​
 +   2. http://​​~albert/​Dev/​
 +   2. http://​​~albert/​Dev/​pucrunch/​
 +.                                    C=H #17
 +VIC-20 Kernal ROM Disassembly Project
 +Richard Cini
 + In order to put this project into perspective,​ a little personal ​
 +history is needed. I received my first Commodore as a gift from my parents ​
 +back in 1982. I used Commodore PETs in the school'​s computer lab, and 
 +Radio Shack Model I's in the local R/S store. It was nice to have one 
 +of my own to hack on, though. ​ Back then, most of my work was with the 
 +built-in BASIC interpreter. My claim to fame (to my family, at least) was 
 +a BASIC/​machine language mailing list management program, and an allophone ​
 +speech synthesizer hardware-software hack. Both worked well, which surprised
 +my mother, who claims that nothing I ever built worked right. ​ As I grew-up, ​
 +my computer-of-choice changed, but I never lost my love for the VIC. It is 
 +small, easy to program, has a very capable processor (by 1980's standards) ​
 +and decent I/O capability. It's peripherals were varied, if not quirky ​
 +(take the 1515 printer, for example), but at least everything worked well 
 +together.  ​
 + Now, fast forward to 1994. Commodore International failed, ​
 +crippled by years of a weak product strategy, squandered opportunities ​
 +and the market'​s increased focus on mainstream PC-compatible or Macintosh ​
 +machines as productivity tools. After Commodore'​s failure, I decided that 
 +I wanted to try to purchase the Commodore 8-bit intellectual property, ​
 +including the rights to the Kernal and BASIC source code, primarily for 
 +preservation purposes. One of my hobbies is collecting and preserving ​
 +obsolete and unsupported computers, accessories,​ documentation,​ etc. 
 + Since Commodore'​s bankruptcy attorney would not return my calls 
 +(no surprise there), I embarked on decompiling the Kernal. This project, ​
 +although time consuming and rewarding from an informational perspective ​
 +was not truly trail-blazing. Many before me probably decompiled parts of 
 +the Kernal in order to gain some understanding as it related to another ​
 +project. However, in my research, I don't recall ever seeing complete ​
 +recompileable source code. Memory and ROM maps, yes; source code, no. 
 + Marko Makela manages a great Commodore web site that contains lots 
 +of useful information,​ including these memory and ROM maps. These provided ​
 +the starting point for my work. See http://​​misc/​cbm/​docs/​ for 
 +this and much more information. ​ What I would like to accomplish in a 
 +series of articles is to explain the process and to discuss specific ​
 +Kernal routines that may be of interest to C=Hacking readers. Also, where 
 +appropriate,​ I will make comparisons with the other Commodore machines ​
 +that were the contemporaries of the VIC, the C64 and the PET, specifically. ​
 + One might ask, "Why is this project different from all of the other 
 +resources already available?"​ In short, here's why:
 +1. The end result is a fully modifiable and compilable source file.
 +2. Only the VIC memory map and ROM location map are available ​
 + *because* to date everyone has focused on the C64 as it is the 
 + more functional (and hence, more popular) Commodore of the era. 
 + The same goes for the PET, too.
 +3. A far as I know, there was no "​Mapping the VIC" written, because ​
 + of #2, above.
 +4. Because of #3, the fact that I had some spare time, and that I 
 + love my VIC (although I don't use it as much these days), I felt 
 + that I had to do it.
 +Brief VIC-20 History
 +Although I'd like to provide a complete history lesson on the VIC-20, many 
 +others before me have done a better job. The March 1985 issue of the 
 +IEEE Spectrum has an article, as does Marko'​s web site. See 
 +http://​​misc/​cbm/​docs/​peddle.english.html for a great article on 
 +Chuck Peddle, the well-known creator of the 6502 microprocessor. Nonetheless,​
 +I will provide a Readers'​ Digest version of the history of the VIC.
 + In the late-70s, engineers in the Advanced Systems Design Group 
 +(ASDG) at Commodore created a multi-function video/sound interface chip, 
 +the 6560 (a.k.a., the VIC). Al Charpentier ran the LSI section of the ASDG,
 +and was the lead in designing the VIC-I, and later, the VIC-II. The ASDG was
 +the old MOS Semiconductor operation that Commodore bought in the mid-70s.
 +The 6560 supported complete composite color video, 3-voice plus white noise 
 +sound, a volume control, two analog-to-digital converters that supported ​
 +the use of a game paddle or joystick, and a light pen interface. ​
 +Feature-rich as it was, no manufacturer wanted to commit a product line to 
 +it. Since Commodore could not find anyone to buy the chip, they decided to 
 +build a computer that featured the chip. Hence, the VIC-20 was born. Al'​s ​
 +buddy, Bob Yannes, was a senior systems designer at Commodore who developed
 +the VIC-20 (and later, the C64) prototype. The VIC was Commodore'​s first 
 +color computer, and the first designed for home use.  ​
 +Relationships with other Commodore Products
 + The structure of the VIC-20 ROM parallels those of Commodore'​s ​
 +other machines of that era, the PET and the C64. All three machines share 
 +the concept of a standard, public, API accessed through a jump table 
 +located in the last page of the system ROM.
 + There'​s also another striking similarity, which comes somewhat as 
 +a surprise to this writer, but intuitively makes sense. The C64 and VIC 
 +Kernal ROMs are nearly identically laid-out and contain a lot of common ​
 +code. The C64 ROM of course contains certain modifications relating to 
 +its unique hardware and capabilities,​ but otherwise is the same. Since 
 +the machines are so similar in design (the same engineer designed them),
 +the similarity of the code isn't surprising. This recycling appears to 
 +have been a cost-effective way to develop a new computer in record time. 
 +The Process
 +----------- ​
 +I believe that the reverse engineering process for the purpose of creating ​
 +source code that produces binary-identical object code, is fairly standard: ​
 +know thy hardware, get the object code, turn object code into assembly code, 
 +give everything names, test compile, fix errors, re-compile and call it done.
 +However, I'm certain that C=Hacking readers have reverse engineering methods
 +which differ from mine. I'd be interested in hearing them, as I'm always ​
 +looking for a better way to do things. ​ I started with an image of the ROM 
 +from my VIC (although a ROM image from funet would work) and cranked it 
 +through a disassembler (I used SuperMon). Since I wasn't too up on 
 +transferring data from the VIC to the PC, I used a brute-force method: ​
 +scanning disassembler listings into TIFF files and running them through ​
 +an OCR program. This produced plain-text files for me to work with.  Then, 
 +I used various available information on the Web and in books (such as "​VIC ​
 +Revealed",​ "The Commodore Innerspace Anthology",​ "​Mapping the 64")
 +to break the code into subroutines. I inserted meaningful memory and 
 +program location labels, taking names from all of the above sources. ​
 +My assembler allowed me to create conventional data and code "​segments",​
 +so I took the time to create a "​real"​ data segment that mirrored the 
 +VIC memory map. These segments are not to be confused with the segments ​
 +supported under the various PeeCee memory models. I used TASM 3.1, a 
 +shareware 8-bit table-based assembler by Squak Valley Software of 
 +Issaquah, WA. TASM supports the 6502, 6800/05/11, 8048/​51/​85/​96,​ Z80 and 
 +TMS7000/​32010/​32025 processors.
 +Next, I read the code and added comments as I went along. I do this on a 
 +routine-by-routine basis, as time permits, but always beginning at the 
 +first instruction after POR. Finally, I did a recompilation and compared ​
 +the recompiled output with the ROM image to make sure that no errors were 
 +introduced in the source code creation process. I wasn't so lucky 
 +the first time around :-). I like to make gross checks first: ROM image 
 +size, location of well-known routines (such as the jump table at $FF85, ​
 +the POR vector at $FD22, and the individual jump table routines). ​
 +This helps to narrow down the location of any errors.
 +Once I was satisfied that the disassembly was right, I created ROM image to 
 +be burned into a test EPROM. The ROMs used in the VIC are 2364 
 +mask-programmed ROMs. 2364s are a 24-pin 8k x 8bit ROM device, and the 
 +closest commonly available EPROM is the 2764 EPROM, a 28-pin 8k x 8bit 
 +device. Since the 2364s have four fewer pins (reflecting its non-
 +programmability),​ an adapter board needs to be built. A piece of perf board
 +and two 28-pin wire-wrap DIP socket should do the trick.
 + If all checks-out, the source is then ready for modification.
 +Issues and Considerations
 + There is one important thing that I discovered while hacking the 
 +Kernal -- there is no wasted space in the Kernal ROM as it presently exists.
 +Actually, the Kernal occupies 1,279 bytes less than 8k, beginning at 
 +$E500 (and consequently,​ the BASIC ROM overhangs the $E000 boundary by 
 +1,279 bytes). The Kernal developers maximized the 8k space, so any Kernal ​
 +hack will have to use jumps to a patch area elsewhere in the processor ​
 +address space. For example, let's say that I wanted to add BASIC 4.0 disk 
 +commands to the VIC Kernal by hacking it (recognizing that I could have 
 +used the easier wedge method). I could place jumps in the Kernal ROM to 
 +locations within my own non-autostart ROM located at $A000. ​ The other 
 +important consideration is backward compatibility. Certain programs may rely 
 +on the specific location of code within the Kernal ROM. For example, a 
 +game ROM may make direct calls to the internal Plot routine, as opposed ​
 +to using the jump table at the end of the Kernal ROM (saving a few 
 +processor ticks in the process). Shifting code around would relocate that 
 +code, breaking that program.
 +General Hardware Information
 +The Microprocessor
 +Manufactured by MOS and second-sourced from Rockwell Semiconductors,​ the stock 
 +6502 is an 8-bit, 1MHz NMOS-process processor. It supports 56 instructions ​
 +in 13 addressing modes (although six are combinations of the seven basic 
 +modes), three processor registers (.A, .X, and .Y), stack pointer, and a 
 +condition code (flags) register. The 6502 supports both maskable and 
 +non-maskable interrupts with fixed vectors at the top of its 64k address ​
 +I/O Processors
 + VIC-20 I/O is managed by two memory-mapped I/O processors, the 
 +6522 versatile interface adapter (VIA). The I/O region occupies the 
 +4k-address space beginning at $9000. ​ The VIAs manage the keyboard, ​
 +joystick, light pen, cassette deck, the IEEE serial interface and the 
 +user port. Each 6522 has 16 bi-directional I/O lines, four handshaking ​
 +lines, two 8-bit shift registers and two clock generators (capable of 
 +generating free-running or triggered pulses). One of these clocks is 
 +responsible for RTC, RS-232, IEEE, and cassette tape timing. Of the eight 
 +handshaking lines, two are free for general use, while the other six are 
 +used for the IEEE serial port, the RESTORE key, and cassette control. ​
 +24-bits of the total 32-bits of I/O are used for keyboard scanning and 
 +joystick, light pen, and serial inputs. Truly available to the user is four 
 +handshaking lines and 8-bits of I/O.
 +Video Interface
 +The VIC-20 video interface is managed by the 6560/6561 VIC chip. The 
 +VIC screen is organized in 22 columns by 24 rows in text mode and 
 +176 by 192 pixels in "​graphics"​ mode. Graphics mode is synthesized by 
 +mapping the character generator ROM to RAM and modifying the character ​
 +glyphs. ​ The on-chip video sync generator is capable of generating video 
 +in NTSC (6560) or PAL (6561) formats in 16 colors. The VIC also contains ​
 +three programmable tone generators, a white noise source, volume control,
 +two A/D converters (for game paddle interfacing),​ a light pen input, screen
 +centering, and independent control over background, foreground, and border ​
 +colors. ​ The 6560/6561 performs its own DMA to separate 4-bit video RAM 
 +and 8-bit character generator ROM. The address buss is a private 2MHz buss,
 +which is not shared with the 1MHz microprocessor buss. Shared ROM access ​
 +is performed during the processor Phase 1 clock.
 +Memory Map
 + Space limitations prohibit listing the complete VIC-20 memory map, 
 +but an abridged version may be helpful:
 +0000-00FF Zero page: Kernal and BASIC system areas
 +0100-01FF ​   Page 1: tape error log area, processor stack
 +0200-02FF ​   Page 2: BASIC input buffer, file and device address tables,
 + keyboard and screen vars, RS232 vars
 +0300-03FF ​   Page 3: BASIC vectors, processor register storage, Kernal ​
 + vectors cassette buffer area
 +0400-0FFF Pages 4-15: 3k expansion area
 +1000-1DFF User Basic area (unexpanded VIC)
 +1E00-1FFF Screen memory (unexpanded VIC)
 +2000-3FFF 8K expansion RAM/ROM block 1
 +4000-5FFF 8K expansion RAM/ROM block 2
 +6000-7FFF 8K expansion RAM/ROM block 3
 +NOTE: When additional memory is added to block 1, 2 or 3,
 +the Kernal relocates the following things for BASIC:
 +1000-11FF Screen memory
 +1200-?​ User Basic area
 +9400-95FF Color RAM
 +8000-8FFF 4K Character generator ROM
 +8000-83FF Upper case and graphics
 +8400-87FF Reversed upper case and graphics
 +8800-8BFF Upper and lower case
 +8C00-8FFF Reversed upper and lower case
 +9000-93FF I/​O Block 0
 +9000-900F VIC chip registers
 +9110-911F 6522 VIA#1 registers
 +9120-912F 6522 VIA#2 registers
 +9400-95FF location of COLOR RAM with additional RAM at blk 1
 +9600-97FF Normal location of COLOR RAM
 +9800-9BFF I/​O Block 2
 +9C00-9FFF I/​O Block 3
 +A000-BFFF 8K block for expansion ROM (autostart ROM)
 +E000-FFFF 8K Kernal ROM
 +Kernal Functions
 +System Startup
 + Let's first take a look at how a VIC-20 boots. The process is 
 +substantially similar for the C64 and the PET (through the commonality ​
 +of the microprocessor upon which each machine is based), although the 
 +locations of various routines differs, as does the memory and I/O map.
 +When power is first applied to the microprocessor,​ the RESET pin is held 
 +low by a 555 timer for a period long enough for the power supply and clock 
 +generator to stabilize. The 6502 utilizes the last six bytes of the address ​
 +space to store three critical vectors: the NMI, RESET, and IRQ vectors, ​
 +respectively. On power-up, the PC (program counter) is initialized to the 
 +address stored at location $FFFC and execution begins at that location ​
 +(the first column in the source code represents the program line number):
 +6495   FFFA ;​=================================================
 +6496   FFFA ; - Power-on and hardware vectors
 +6497   FFFA ;
 +6498   FFFA A9 FE        .dw NMI ;​non-maskable interrupt
 +6499   FFFC 22 FD        .dw RESET ;POR
 +6500   FFFE 72 FF        .dw IRQ ;IRQ processor
 +Execution begins at $FD22, the POR (power-on reset) vector:
 +5971   ​FD22 ​ ;#################################################​
 +5972   ​FD22 ​ ; Power-on RESET entry
 +5973   ​FD22 ​ ;#################################################​
 +5974   ​FD22 ​            ​RESET
 +5975   FD22 A2 FF        LDX #$FF
 +5976   FD24 78          SEI ;kill interrupts
 +5977   FD25 9A          TXS ;set stack top
 +5978   FD26 D8          CLD
 +5979   FD27 20 3F FD    JSR SCNROM ;​check for autostart ROM
 +5980   FD2A D0 03        BNE SKIPA0 ;not there, skip ROM init
 +5981   ​FD2C ​            
 +5982   FD2C 6C 00 A0    JMP (A0BASE) ;​jump to ROM init if present
 +5983   ​FD2F ​            
 +5984   ​FD2F ​            ​SKIPA0
 +5985   FD2F 20 8D FD    JSR RAMTAS ;​test RAM
 +5986   FD32 20 52 FD    JSR IRESTR ;​init work memory
 +5987   FD35 20 F9 FD    JSR IOINIT ;​setup hardware
 +5988   FD38 20 18 E5    JSR CINT1 ;init video
 +5989   FD3B 58          CLI ;​re-enable interrupts
 +5990   FD3C 6C 00 C0    JMP (BENTER) ;​enter BASIC
 + The startup routines setup the processor stack, check for the 
 +existence of an autostart ROM. Autostart ROMs are located in the $A000 
 +block and have a five-byte signature (A0CBM) at offset $04. If the 
 +signature is found, the Kernal jumps to the A0ROM initialization routine ​
 +pointed to by offset $00 of the autostart ROM.
 + If no signature is found, the Kernal initialization continues by 
 +testing the RAM, initializing the system variables, system hardware, and 
 +the screen. Finally, the Kernal transfers control to the BASIC 
 +initialization entry point at $C000.
 +Routine SCNROM
 + The first routine called from the POR code is the SCNROM routine. ​
 +This routine looks for the special 5-byte signature that indicates the 
 +presence of an autostart ROM located in the $A segment.
 +5992   ​FD3F ​ ;​====================================================
 +5993   ​FD3F ​ ; SCNROM - Scan ROM areas for Autostart ROM signature
 +5994   ​FD3F ​ ;
 +5995   ​FD3F ​            ​SCNROM
 +5996   FD3F A2 05        LDX #$05 ;5 chars to compare
 +5997   ​FD41 ​            
 +5998   ​FD41 ​            ​SCNLOOP
 +5999   FD41 BD 4C FD    LDA SCANEX,​X ;​start at end of signature
 +6000   FD44 DD 03 A0    CMP A0BASE+3,​X ;​compare to ROM sig area
 +6001   FD47 D0 03        BNE SCANEX ;no match, exit loop
 +6002   ​FD49 ​            
 +6003   FD49 CA          DEX ;​match;​ check next char
 +6004   FD4A D0 F5        BNE SCNLOOP ;​loop
 +6005   ​FD4C ​            
 +6006   ​FD4C ​            ​SCANEX
 +6007   FD4C 60          RTS ;​return;​ Z=0 if no match
 +6008   ​FD4D ​            ;
 +6009   ​FD4D ​            ; ROMSIG - Autostart ROM signature
 +6010   ​FD4D ​            ;
 +6011   ​FD4D ​            ​ROMSIG
 +6012   FD4D 4130C3C2CD ​ .db "​A0",​ $C3, $C2, $CD  ;A0CBM
 +Routine RAMTAS
 + The RAMTAS routine is the second subroutine in the initialization ​
 +process. It clears the first three pages of RAM, then searches for expansion
 +memory. If any is found, the screen memory, color memory, and start of 
 +BASIC RAM pointers are adjusted to their documented alternates.
 +6058   ​FD8D ​ ;​===================================================
 +6059   ​FD8D ​ ; RAMTAS - Initialize system contents
 +6060   ​FD8D ​ ;
 +6061   ​FD8D ​            ​RAMTAS
 +6062   FD8D A9 00        LDA #$00 ;zero regs .A and .X
 +6063   FD8F AA          TAX
 +6064   ​FD90 ​            
 +6065   ​FD90 ​            ​RAMTSLP1 ;​clear system memory areas
 +6066   FD90 95 00        STA USRPOK,​X ;​zero page
 +6067   FD92 9D 00 02    STA BUF,​X ;​clear page 2
 +6068   FD95 9D 00 03    STA ERRVPT,​X ;​clear page 3
 +6069   FD98 E8          INX
 +6070   FD99 D0 F5        BNE RAMTSLP1 ;​loop till done
 +6071   ​FD9B ​            
 +6072   FD9B A2 3C        LDX #$3C ;setup cassette buffer
 +6073   FD9D A0 03        LDY #$03 ;area to $033c
 +6074   FD9F 86 B2        STX TAPE1
 +6075   FDA1 84 B3        STY TAPE1+1
 +6076   FDA3 85 C1        STA STAL ;clear I/O start address…
 +6077   FDA5 85 97        STA REGSAV ;​…register save
 +6078   FDA7 8D 81 02    STA OSSTAR ;​…and start of OS memory ptr
 +6079   FDAA A8          TAY ; .Y=0
 +6080   FDAB A9 04        LDA #$04 ;check RAM from $0400
 +6081   FDAD 85 C2        STA STAL+1 ;set I/O start to page 3
 +6082   ​FDAF ​            
 +6083   ​FDAF ​            ​RAMTASLP2
 +6084   FDAF E6 C1        INC STAL ;​increment LSB
 +6085   FDB1 D0 02        BNE RAMTAS1 ;​not done with page, cont.
 +6086   ​FDB3 ​            
 +6087   FDB3 E6 C2        INC STAL+1 ;​inc. to new page
 +6088   ​FDB5 ​            
 +6089   ​FDB5 ​            ​RAMTAS1
 +6090   FDB5 20 91 FE    JSR MEMTST ;​test RAM
 +6091   FDB8 A5 97        LDA REGSAV
 +6092   FDBA F0 22        BEQ RAMTAS3
 +6093   FDBC B0 F1        BCS RAMTASLP2 ;​next address
 +6094   ​FDBE ​            
 +6095   FDBE A4 C2        LDY STAL+1 ;​done testing,get RAM top MSB…
 +6096   FDC0 A6 C1        LDX STAL ;​…and LSB
 +6097   FDC2 C0 20        CPY #$20 ; top at $2000
 +6098   FDC4 90 25        BCC I6561LP ;​page below $2000, halt
 +6099   ​FDC6 ​            
 +6100   FDC6 C0 21        CPY #$21 ;RAM at $2000?
 +6101   FDC8 B0 08        BCS RAMTAS2 ;​yes,​ set params
 +6102   ​FDCA ​            
 +6103   FDCA A0 1E        LDY #$1E ;$1E00
 +6104   FDCC 8C 88 02    STY HIPAGE
 +6105   ​FDCF ​            
 +6106   ​FDCF ​            ​RAMTAS1A
 +6107   FDCF 4C 7B FE    JMP STOTOP ;CLC and set RAM top
 +6108   ​FDD2 ​            
 +6109   ​FDD2 ​            ​RAMTAS2
 +6110   FDD2 A9 12        LDA #$12 ;With exp. RAM, BASIC starts
 +6111   FDD4 8D 82 02    STA OSSTAR+1 ;​at $1200…
 +6112   FDD7 A9 10        LDA #​$10 ;​…and screen starts at $1000
 +6113   FDD9 8D 88 02    STA HIPAGE
 +6114   FDDC D0 F1        BNE RAMTAS1A ;​set top of RAM and exit
 +6115   ​FDDE ​            
 +6116   ​FDDE ​            ​RAMTAS3
 +6117   FDDE 90 CF        BCC RAMTASLP2 ;​loop to next address
 +6118   ​FDE0 ​            
 +6119   FDE0 A5 C2        LDA STAL+1 ;get MSB of I/O start
 +6120   FDE2 8D 82 02    STA OSSTAR+1 ;​save as start of OS
 +6121   FDE5 85 97        STA REGSAV ;​save copy
 +6122   FDE7 C9 11        CMP #$11 ;page $11
 +6123   ​FDE9 ​            
 +6124   ​FDE9 ​            ​RATS3
 +6125   FDE9 90 C4        BCC RAMTASLP2
 +6126   ​FDEB ​            
 +6127   ​FDEB ​            ​I6561LP
 +6128   FDEB 20 C3 E5    JSR V6561I-2 ;​$E5C3 init VIC regs
 +6129   FDEE 4C EB FD    JMP I6561LP
 +This routine actually tests the RAM, and is called during the memory ​
 +search loop at $FDB5. It uses a simple walking-bit pattern to test for 
 +memory defects:
 +6271   ​FE91 ​ ;​===================================================
 +6272   ​FE91 ​ ; MEMTST - Test memory
 +6273   ​FE91 ​ ; .Y is index in page
 +6274   ​FE91 ​            ​MEMTST
 +6275   FE91 B1 C1        LDA (STAL),​Y ;​get address
 +6276   FE93 AA          TAX ;save .A
 +6277   FE94 A9 55        LDA #​%01010101 ;​set pattern
 +6278   FE96 91 C1        STA (STAL),​Y ;​save it…
 +6279   FE98 D1 C1        CMP (STAL),​Y ;​…and compare
 +6280   FE9A D0 08        BNE MEMTS1 ;not equal
 +6281   ​FE9C ​            ​ ;​pattern compares OK
 +6282   FE9C 6A          ROR A ;​%10101010 invert pattern
 +6283   FE9D 91 C1        STA (STAL),​Y ;​save it…
 +6284   FE9F D1 C1        CMP (STAL),​Y ;​…and compare
 +6285   FEA1 D0 01        BNE MEMTS1 ;not equal
 +6286   FEA3 A9          .db $A9 ;LDA #$18 for OK, $55 or $AA
 +6287   ​FEA4 ​             ; for failed pattern
 +6288   ​FEA4 ​            ​MEMTS1
 +6289   FEA4 18          CLC ;CLC only on error
 +6290   FEA5 8A          TXA ;​restore previous .A
 +6291   FEA6 91 C1        STA (STAL), Y ;save it
 +6292   FEA8 60          RTS
 + The RAMTAS routine also calls STOTOP to save the top of RAM pointer:
 +6241   ​FE73 ​ ;​==================================================
 +6242   ​FE73 ​ ; IMEMTP - Set/read top of memory (internal)
 +6243   ​FE73 ​ ; On entry, SEC to read, .X/.Y is LSB/MSB
 +6244   ​FE73 ​ ;           CLC to set, .X/.Y is LSB/MSB
 +6245   ​FE73 ​ ;
 +6246   ​FE73 ​            ​IMEMTP
 +6247   FE73 90 06        BCC STOTOP ;set or read?
 +6248   FE75 AE 83 02    LDX OSTOP ;get top of memory
 +6249   FE78 AC 84 02    LDY OSTOP+1
 +6250   ​FE7B ​            
 +6251   ​FE7B ​            ​STOTOP
 +6252   FE7B 8E 83 02    STX OSTOP ;set top of memory
 +6253   FE7E 8C 84 02    STY OSTOP+1
 +6254   FE81 60          RTS
 +Routine IRESTR ​
 + This routine loads (or re-loads) the default Kernal vectors upon 
 +POR or Run-Stop/​Restore sequences. The default vectors that are loaded ​
 +include the links to IRQ, NMI, Open, Close, Channel In, Channel Out, 
 +Clear Channels, Character In, Character Out, Scan Stop Key, Get Keyboard ​
 +Character, Close All, Load and Save routines within the Kernal ROM.
 +6014   ​FD52 ​ ;​====================================================
 +6015   ​FD52 ​ ; IRESTR - Restore KERNAL hardware vectors (internal)
 +6016   ​FD52 ​ ; Called during POR and NMI sequences.
 +6017   ​FD52 ​ ;
 +6018   ​FD52 ​            ​IRESTR
 +6019   FD52 A2 EA        LDX #$EA ;​FIXUP2;#​$6D points to list of 
 +6020   FD54 A0 EA        LDY #$EA ;​FIXUP2+1;#​$FD $FD6D KERNAL vecs
 +6021   FD56 18          CLC
 +6022   ​FD57 ​            ;
 +6023   ​FD57 ​            ; IVECTR - Change vectors for user
 +6024   ​FD57 ​            ; On entry, SEC= read vector to .X/.Y LSB/MSB
 +6025   ​FD57 ​            ;​  ​         CLC= set vector from .X/.Y LSB/MSB
 +6026   ​FD57 ​            ;
 +6027   ​FD57 ​            ​IVECTR
 +6028   FD57 86 C3        STX MEMUSS ;​save vector to temp
 +6029   FD59 84 C4        STY MEMUSS+1
 +6030   FD5B A0 1F        LDY #$1F ;# of bytes to move
 +6031   ​FD5D  ​
 +6032   ​FD5D ​            ​VECLOOP
 +6033   FD5D B9 B6 02    LDA IRQVP,​Y ;​get old vector address
 +6034   FD60 B0 02        BCS VECSK ;​branch on CY=1/read
 +6035   ​FD62 ​            
 +6036   FD62 B1 C3        LDA (MEMUSS),​Y ;​get new vector address
 +6037   ​FD64 ​            
 +6038   ​FD64 ​            VECSK
 +6039   FD64 91 C3        STA (MEMUSS),​Y ;​save new address to temp
 +6040   FD66 99 B6 02    STA IRQVP,​Y ;​and to vector area
 +6041   FD69 88          DEY ;go to next one
 +6042   FD6A 10 F1        BPL VECLOOP ;​loop
 +6043   FD6C 60          RTS
 +6044   ​FD6D  ​
 +6045   ​FD6D ​            ;
 +6046   ​FD6D ​            ;​KERNAL Vectors
 +6047   ​FD6D ​            ;
 +6048   ​FD6D ​            ​KNRLSV
 +6053   FD73 0AF4
 +6054   FD7B F3F3
 +6055   FD7D 0EF27AF270F7 .dw ICHRIN, ICHROT, ISTOP, ​ IGETIN
 +6055   FD83 F5F1
 +6056   FD8B 85F6
 +Routine IOINIT
 + This routine initializes the VIAs. Lots of bit twiddling goes on 
 +here to set-up the various ports. This routine also starts the system ​
 +IRQ timer.
 +6138   ​FDF9 ​ ;​===================================================
 +6139   ​FDF9 ​ ; IOINIT - Initialize I/O registers
 +6140   ​FDF9 ​ ;
 +6141   ​FDF9 ​            ​IOINIT
 +6142   FDF9 A9 7F        LDA #​%01111111 ;​disable HW interrupts
 +6143   FDFB 8D 1E 91    STA D1IER ;​interrupt enable reg VIA1
 +6144   FDFE 8D 2E 91    STA D2IER ;​interrupt enable reg VIA2
 +6145   FE01 A9 40        LDA #​%01000000 ;​Sets tmr1/VIA2 to free-
 + ;​running;​ used for IRQ
 +6146   FE03 8D 2B 91    STA D2ACR ;VIA2 aux ctrl reg
 +6147   FE06 A9 40        LDA #​%01000000 ;​same for tmr1/VIA1. Used for
 + ; RS-232 timing
 +6148   FE08 8D 1B 91    STA D1ACR ;VIA1 aux ctrl reg
 +6149   FE0B A9 FE        LDA #​%11111110 ;​sets CA1/2, CB1/2 modes
 + ; CA2/CB2 manual H, CB1 pos
 + ; trig. CA1 negative trig.
 + ;​CA1=Restore
 + ;​CA2=cassette motor
 + ;​CB1=user port
 + ;​CB2=user port
 +6150   FE0D 8D 1C 91    STA D1PCR ;VIA1 periph ctrl reg
 +6151   FE10 A9 DE        LDA #​%11011110 ;​sets CA1/2, CB1/2 modes
 + ; CB2 manual L, CB1 pos
 + ; trig. CA1 negative trig.
 + ; CA2 manual H
 + ;​CA1=cassette read
 + ;​CA2=*SCLK
 + ;​CB1=*SRQIN
 +6152   FE12 8D 2C 91    STA D2PCR ;VIA2 periph ctrl reg
 +6153   FE15 A2 00        LDX #$00 ;DDR all bits IN
 +6154   FE17 8E 12 91    STX D1DDRB ;​VIA1/​B data dir reg
 +6155   FE1A A2 FF        LDX #​%11111111 ;​DDR all bits OUT
 +6156   FE1C 8E 22 91    STX D2DDRB ;​VIA2/​B data dir reg
 +6157   FE1F A2 00        LDX #$00 ;DDR all bits IN
 +6158   FE21 8E 23 91    STX D2DDRA ;​VIA2/​A data dir reg
 +6159   FE24 A2 80        LDX #​%10000000 ;​BIT7=OUT,​ BITS6-0 IN
 +6160   FE26 8E 13 91    STX D1DDRA ;​VIA1/​A data dir reg
 +6161   FE29 A2 00        LDX #$00
 +6162   FE2B 8E 1F 91    STX D1ORAH ;​VIA1 output reg A MSB
 +6163   FE2E 20 84 EF    JSR SCLK1 ;set IEEE clock line=1
 +6164   FE31 A9 82        LDA #​%10000010 ;​enable IER CA1/VIA1 RESTOR
 +6165   FE33 8D 1E 91    STA D1IER ;VIA1 IER BIT7=1
 +6166   FE36 20 8D EF    JSR SCLK0 ;set IEEE clock line=0
 +6167   ​FE39 ​            ;
 +6168   ​FE39 ​            ; ENABTM - Enable timers
 +6169   ​FE39 ​            ;
 +6170   ​FE39 ​            ​ENABTM
 +6171   FE39 A9 C0        LDA #​%11000000 ;​enable tmr1/VIA2 (IRQ)
 +6172   FE3B 8D 2E 91    STA D2IER ;VIA2 IER BIT7-6=1
 +6173   FE3E A9 89        LDA #​%10001001 ;​$89 IRQ tic divisor LSB
 +6174   FE40 8D 24 91    STA D2TM1L ;​VIA2 tmr1 LSB
 +6175   FE43 A9 42        LDA #​%01000010 ;​$42 IRQ tic divisor MSB
 +6176   FE45 8D 25 91    STA D2TM1L+1 ;​VIA2 tmr1 MSB
 +6177   FE48 60          RTS
 +Routine CINT1
 + This final routine initializes the character generator, sets the 
 +initial screen colors, clears the screen and "​homes"​ the cursor, and 
 +updates screen and cursor pointers.
 +1349   ​E518 ​ ;​==================================================
 +1350   ​E518 ​ ; CINT1 - Initialize I/O
 +1351   ​E518 ​ ;
 +1352   ​E518 ​  
 +1353   ​E518 ​            ;
 +1354   ​E518 ​            ;​Screen reset
 +1355   ​E518 ​            ;
 +1356   ​E518 ​            CINT1
 +1357   E518 20 BB E5    JSR IODEF1 ;set deflt I/O and init VIC
 +1358   E51B AD 88 02    LDA HIPAGE ;get screen memory page
 +1359   E51E 29 FD        AND #​%11111101 ;​$FD MS nibble is ChrROM and
 +1360   E520 0A          ASL A ;LS nibble is ChrRAM
 +1361   E521 0A          ASL A
 +1362   E522 09 80        ORA #​%10000000 ;​$80
 +1363   E524 8D 05 90    STA VRSTRT ;set chargen ROM to $8000
 +1364   E527 AD 88 02    LDA HIPAGE ;get screen mem page
 +1365   E52A 29 02        AND #​%00000010 ;​$02 check for screen RAM at 
 +1366   E52C F0 08        BEQ CINT1A ;​$E536 $1E page
 +1367   ​E52E ​            
 +1368   E52E A9 80        LDA #​%10000000 ;​$80 screen RAM is at $10 page
 +1369   E530 0D 02 90    ORA VRCOLS ;set Bit7
 +1370   E533 8D 02 90    STA VRCOLS
 +1371   ​E536 ​            
 +1372   ​E536 ​            ​CINT1A
 +1373   E536 A9 00        LDA #$00
 +1374   E538 8D 91 02    STA SHMODE ;​enable shift-C=
 +1375   E53B 85 CF        STA BLNON ;​start at no blink
 +1376   ​E53D ​            
 +1377   E53D A9 EA        LDA #$EA ;​FIXUP1+34;#​$DC
 +1378   E53F 8D 8F 02    STA FCEVAL
 +1379   E542 A9 EA        LDA #$EA ;​FIXUP1+35;#​$EB
 +1380   E544 8D 90 02    STA FCEVAL+1 ;​shift mode evaluation
 +1381   ​E547 ​            
 +1382   E547 A9 0A        LDA #$0A
 +1383   E549 8D 89 02    STA KBMAXL ;key buffer=16
 +1384   E54C 8D 8C 02    STA KRPTDL ;​repeat delay=16ms
 +1385   E54F A9 06        LDA #$06
 +1386   E551 8D 86 02    STA CLCODE ;​color=6(blue)
 +1387   E554 A9 04        LDA #$04
 +1388   E556 8D 8B 02    STA KRPTSP ;​repeat speed
 +1389   E559 A9 0C        LDA #$0C
 +1390   E55B 85 CD        STA BLNCT ;​blink timer=12ms
 +1391   E55D 85 CC        STA BLNSW ;set for solid cursor
 +1392   ​E55F ​            ;
 +1393   ​E55F ​            ; Clear screen
 +1394   ​E55F ​            ;
 +1395   ​E55F ​            ​CLRSCN
 +1396   E55F AD 88 02    LDA HIPAGE ;mem page for screen RAM
 +1397   E562 09 80        ORA #​%10000000 ;​$80
 +1398   E564 A8          TAY
 +1399   E565 A9 00        LDA #$00
 +1400   E567 AA          TAX
 +1401   ​E568 ​            
 +1402   ​E568 ​            ​CLRLP1
 +1403   E568 94 D9        STY SLLTBL,​X ;​address of screen line
 +1404   E56A 18          CLC
 +1405   E56B 69 16        ADC #$16 ;add 22
 +1406   E56D 90 01        BCC CLRSC1
 +1407   ​E56F ​            
 +1408   E56F C8          INY
 +1409   ​E570 ​            
 +1410   ​E570 ​            ​CLRSC1
 +1411   E570 E8          INX
 +1412   E571 E0 18        CPX #$18 ;all rows done?
 +1413   E573 D0 F3        BNE CLRLP1
 +1414   ​E575 ​            
 +1415   E575 A9 FF        LDA #$FF
 +1416   E577 95 D9        STA SLLTBL,X
 +1417   E579 A2 16        LDX #$16
 +1418   ​E57B ​            
 +1419   ​E57B ​            ​CLRLP2
 +1420   E57B 20 8D EA    JSR CLRLIN ;​clear line
 +1421   E57E CA          DEX
 +1422   E57F 10 FA        BPL CLRLP2
 +1423   ​E581 ​            ;
 +1424   ​E581 ​            ; "​Home"​ cursor
 +1425   ​E581 ​            ;
 +1426   ​E581 ​            ​HOME
 +1427   E581 A0 00        LDY #$00
 +1428   E583 84 D3        STY CSRIDX ;set column to 0
 +1429   E585 84 D6        STY CURROW ;and row to 0, too
 +1430   ​E587 ​            ;
 +1431   ​E587 ​            ; Set screen pointers
 +1432   ​E587 ​            ;
 +1433   ​E587 ​            ​SCNPTR
 +1434   E587 A6 D6        LDX CURROW
 +1435   E589 A5 D3        LDA CSRIDX
 +1436   ​E58B ​            
 +1437   ​E58B ​            ​SCNPLP
 +1438   E58B B4 D9        LDY SLLTBL,X
 +1439   E58D 30 08        BMI SCNPT1
 +1440   ​E58F ​            
 +1441   E58F 18          CLC
 +1442   E590 69 16        ADC #$16
 +1443   E592 85 D3        STA CSRIDX
 +1444   E594 CA          DEX
 +1445   E595 10 F4        BPL SCNPLP
 +1446   ​E597 ​            
 +1447   ​E597 ​            ​SCNPT1
 +1448   E597 B5 D9        LDA SLLTBL,X
 +1449   E599 29 03        AND #$03
 +1450   E59B 0D 88 02    ORA HIPAGE
 +1451   E59E 85 D2        STA LINPTR+1
 +1452   E5A0 BD FD ED    LDA LBSCAD,X
 +1453   E5A3 85 D1        STA LINPTR
 +1454   E5A5 A9 15        LDA #$15
 +1455   E5A7 E8          INX
 +1456   ​E5A8 ​            
 +1457   ​E5A8 ​            ​SCNLP1
 +1458   E5A8 B4 D9        LDY SLLTBL,X
 +1459   E5AA 30 06        BMI SCNEXIT
 +1460   ​E5AC ​            
 +1461   E5AC 18          CLC
 +1462   E5AD 69 16        ADC #$16
 +1463   E5AF E8          INX
 +1464   E5B0 10 F6        BPL SCNLP1
 +1465   ​E5B2 ​            
 +1466   ​E5B2 ​            ​SCNEXIT
 +1467   E5B2 85 D5        STA LINLEN
 +1468   E5B4 60          RTS ; return to init routine
 + CINT1 calls one routine, IODEF1, which resets the default input 
 +and output devices to the keyboard and screen, respectively,​ and then 
 +resets the VIC chip's registers to their default. The one interesting ​
 +thing to note is the PANIC entry at $E5B5. The Kernal does not call the 
 +PANIC entry! The Kernal bypasses the extra JSR by calling the IODEF1 code 
 +directly. Not that there is any magic in the PANIC entry; it only calls 
 +the IODEF1 code and homes the cursor. ​ I haven'​t yet examined the 
 +BASIC ROM, so it is possible that BASIC calls or vectors the PANIC entry.
 +1404   ​E5B5 ;​================================================
 +1405   ​E5B5 ;​ PANIC - Set I/O defaults (unused entry point??)
 +1406   ​E5B5 ;​
 +1407   ​E5B5 ​            ​PANIC
 +1408   E5B5 20 BB E5    JSR IODEF1 ;​reset devices and VIC regs
 +1409   E5B8 4C 81 E5    JMP HOME ;home cursor
 +1410   ​E5BB ​          ;
 +1411   ​E5BB ​          ; Real PANIC entry; reset default devices
 +1412   ​E5BB ​          ;
 +1413   ​E5BB ​            ​IODEF1
 +1414   E5BB A9 03        LDA #$03
 +1415   E5BD 85 9A        STA OUTDEV ;​reset output to screen
 +1416   E5BF A9 00        LDA #$00
 +1417   E5C1 85 99        STA INDEV ;​reset input to kbrd
 +1418   ​E5C3 ​            ;​
 +1419   ​E5C3 ​            ; Initialize 6561 VIC
 +1420   ​E5C3 ​            ;​
 +1421   E5C3 A2 10        LDX #$10 ;move 16 VIC registers
 +1422   ​E5C5 ​            
 +1423   ​E5C5 ​            
 +1424   ​E5C5 ​            ​V6561I
 +1425   E5C5 BD E3 ED    LDA VICSUP-1,​X ;​start at end of tbl
 +1426   E5C8 9D FF 8F    STA $8FFF,​X ;​start at end of regs
 +1427   E5CB CA          DEX ;​decrement index
 +1428   E5CC D0 F7        BNE V6561I ;do next register
 +1429   ​E5CE ​            
 +1430   E5CE 60          RTS
 +2789   ​EDE4 ​            ;
 +2790   ​EDE4 ​            ;VIC chip video control constants
 +2791   ​EDE4 ​            ;
 +2792   ​EDE4 ​            ​VICSUP
 +2793   ​EDE4 .db $05 ;bit7 interlace, 6-0 HCenter
 +.db $19 ;​VCenter
 +.db $16 ;​bit7=video address, 6-0 #rcols
 +.db $2E ;​bit6-1=#​rows,​ bit0=8x8 or 16x8 chars
 +.db $00 ;​current TV raster beam line
 +.db $C0 ;bit0-3 start of char memory
 + ;bit4-7 is rest of video address
 +;BITS 3,2,1,0 CM starting address
 +;             ​HEX ​  DEC
 +;0000   ​ROM ​ 8000  32768 *default
 +;0001        8400  33792
 +;0010        8800  34816
 +;0011        8C00  35840
 +;1000   ​RAM ​ 0000  0000
 +;1001        xxxx  }
 +;1010        xxxx  }unavail.
 +;1011        xxxx  }
 +;1100        1000  4096
 +;1101        1400  5120
 +;1110        1800  6144
 +;1111        1C00  7168
 +.db $00 ;Hpos of light pen
 +.db $00 ;Vpos of light pen
 +2794   ​EDEC .db $00 ;Digitized value of paddle X
 +.db $00 ;Digitized value of paddle Y
 +.db $00 ;Frequency for oscillator 1 (low)
 +.db $00 ;Frequency for oscillator 2 (medium)
 +.db $00 ;Frequency for oscillator 3 (high)
 +.db $00 ;Frequency of noise source
 +.db $00 ; bit0-3 sets volume of all sound
 +; bit4-7 are auxiliary color information
 +.db $1B ;Screen and border color register
 +; bits 4-7 select background color
 +; bits 0-2 select border color
 +; bit 3 selects inverted or normal mode
 + Next time, we'll examine more routines and answer any questions ​
 +that you may have.
 +.                                    C=H #17
 +NTSC-PAL fixing, part 1
 +Russel Reed <​>,​ Robin Harbron <​>,​ S. Judd
 + Just about everyone knows that there are differences between
 +NTSC and PAL machines. ​ Most people are also familiar with at least a 
 +few of the technical issues, such as extra raster lines, and know that
 +these differences can cause certain types of programs to fail, in particular
 +games and demos. ​ But how does one actually go about fixing one of
 +these programs? ​ It is that with which this series of articles is
 +  "​Fixing"​ has been a job which a number of 64 hackers have taken on, 
 +in order to enjoy programs written in another country. ​ Some of the major
 +game companies imported software from other countries and sometimes had
 +to reprogram the software so it would run correctly. ​ Fixing is not
 +always a simple task, but there are some techniques and strategies which
 +are useful in most cases. ​ There are only a few different classes of
 +problems overall.
 + We decided to approach this problem in a novel way: a pair of
 +yay-hoos (Robin and Steve) would place themselves under the tutelage
 +of an experienced fixer (Russ, a.k.a. Decomp/​Style),​ fix up a program,
 +and write up the results and experiences. ​ The first step was deciding
 +on which program (or programs) to fix.  It had to first of all be
 +fixable! ​ Since time is always in short supply it couldn'​t be a big, 
 +complicated project. ​ And finally, since we were just getting our fixing
 +feet wet, the actual fixing job needed to be fairly straightforward (so 
 +the big custom track-loading demo will have to wait for a future article).
 + But demos are the natural fixing candidate, and after viewing ​
 +several different ones, and examining the code to see what would need 
 +fixing, we finally decided on the demo "Slow Ideas",​ written by a couple
 +of crazy Finns way back in 1989.  It's a cool demo, was challenging and
 +fun to fix, and turned out to be pedagogically a good choice. ​ It's
 +two pages, and each page had a number of effects in need of fixing.
 +These pages, and the practical side of fixing, will be discussed in
 +some detail below, but first we need to review the differences between
 +PAL and NTSC, and discuss the implications of those differences,​ and
 +how they manifest themselves in programs.
 +NTSC/PAL Differences
 + There is really just one primary difference between NTSC and PAL 
 +machines: the graphics, which means VIC.  But since VIC generates the
 +machine clock cycles, that means the computers run at different speeds.
 +And since they run at different speeds, the CIA timers run at different
 +speeds, the SIDs run at different speeds, and the CPUs run at different
 +speeds. ​ So just having a slightly different graphics format affects
 +nearly every aspect of the machine'​s operation.
 +VIC and graphics
 + The PAL television standard is different from the NTSC standard.
 +The primary difference is actually the way color is encoded, but 
 +the main issue for the 64 is the frame rate, the number of raster ​
 +lines, and the number of cycles per raster line: 
 +VIC chip         Frame Rate  Raster lines  Cycles per line
 +6567R56A (NTSC): ​   60Hz          262          64
 +6567R8+ ​ (NTSC): ​   60Hz          263          65
 +6569     (PAL) :    50Hz          312          63
 +As the video raster beam sweeps across the television tube, VIC tells 
 +it what to display, one pixel at a time.  The CPU clock is exactly
 +1/8th of the "pixel clock",​ so that one CPU cycle corresponds to
 +eight pixels on the screen. ​ Thus, it is clear from the above table
 +that a PAL machine has 63*8 = 504 pixels per raster line.  320 of
 +those pixels comprise the visible display, while the other 184 make
 +up the left and right borders.
 + What is important here are the three numbers in the table, though.
 +First consider the number of cycles per line.  If a program is exactly
 +synchronized with the raster, then it can make precise changes to the 
 +screen merely by letting a certain number of cycles elapse. ​ A simple ​
 +example is making raster bars; a more involved example is opening the side 
 +borders, or generating an FLI display (which you can read about in previous
 +issues of C=Hacking). ​ Needless to say, programs which require exact cycle 
 +timings will fail when run on a machine with a different number of cycles ​
 +per line.  (Note that on older computers, like the old Atari 2600, the CPU
 +actually built the screen display, and so _all_ the screen code had to 
 +be exactly timed).
 + Next observe the different number of raster lines. ​ The visible ​
 +display begins on raster line 50, and there are 200 visible raster lines 
 +(320x200, you know). ​ That leaves 13 raster lines for the NTSC border, but 
 +62 raster lines for the PAL border. ​ This causes two problems. ​ Code which 
 +waits for a raster line greater than 263 will have problems on an NTSC 
 +machine. ​ A busy loop such as
 + LDA #$10 ;Wait for line 266
 + CMP $D012
 + BNE *-5
 + LDA $D011
 + BPL *-12
 +will never exit, while a raster IRQ will never occur. ​ The code must be 
 +adjusted to use a different raster line or another method of timing.  ​
 +Sprite graphics on lines greater than 263 will wrap around on an NTSC 
 +screen and in some cases the image will be displayed twice. ​ The sprites ​
 +must either be moved or their images truncated to correct this.
 + Moreover, the extra raster lines mean extra cycles per PAL frame.
 +Using the table above, an NTSC machine has 263*65 = 17095 cycles per frame, ​
 +and a PAL machine has 312*63 = 19656 cycles per frame. ​ In many demos and 
 +games those extra 2500 cycles are used to perform needed calculations and 
 +operations before the next frame begins, which leads to all kinds of
 +trouble on an NTSC machine. ​ Note that NTSC machines have more cycles ​
 +available in the _visible_ display, while PAL machines have many more
 +cycles available in the borders.
 + Finally, consider the frame rate.  On an NTSC machine, there are 
 +17095 cycles per frame, and 60 frames per second, giving 17095*60 = 1025700
 +cycles per second, or 1.02 MHz.  On a PAL machine, there are 19656*50 = 982800
 +cycles per second, or 0.98 MHz.  So although a PAL machine has more cycles
 +per frame, the CPU runs slightly slower than on an NTSC machine. ​ Thus a
 +game like Elite, which involves raw computation,​ runs a little faster
 +on an NTSC machine. ​ But for most games and demos it is the cycles per
 +frame which is important -- as long as all game calculations can get done
 +before the next frame, the game can run at the full frame rate.  Also
 +note that most tunes are synced to the screen, so when a PAL tune,
 +designed to play at 50 calls per second, is suddenly called 60 times
 +each second, it will play noticeably faster.
 + By the way, there actually aren't _exactly_ 50 or 60 frames per
 +second. ​ The frames per second is actually determined by the machine
 +clock rate, not the other way around! ​ The actual system clock rates are
 +14318181 / 14 = 1022727Hz for NTSC and 17734472 / 18 = 985248Hz for PAL.
 +Dividing by 17095 (PAL=19656) cycles per frame gives 59.826 frames/​second ​
 +NTSC and 50.124 fps PAL.
 + The important thing to remember here is that PAL machines run 
 +slightly slower than NTSC machines, but have many more cycles per video 
 + Just as a side node, the above calculation should indicate to you
 +that although the AC electricity lines are 60Hz in the US and 50Hz in Europe, ​
 +that has nothing to do with the 50/60Hz PAL/NTSC frame rates. ​ Not only can 
 +an NTSC monitor easily display a 50Hz signal (let alone 59.826Hz), but the 
 +actual power frequency fluctuates around that 50/60Hz anyways, so that 
 +the AC line frequency is only 50/60Hz on _average_ -- good enough to
 +run a clock, but not nearly precise enough to generate a video signal.
 + Also note that there are two different NTSC VIC chips in the table.
 +The 64 cycles/line VIC was present in the earliest 64s shipped. ​ This is 
 +actually a bug in the chip, though, and 65 cycles/line is the "​correct",​
 +not to mention most common, NTSC VIC chip, and the one which we will
 +refer to in this article. ​ Oh?  You don't believe me?  Well, from the 
 +March 1985 IEEE Spectrum article:
 + In addition to the difficuly with the ROM [sparklies],​ "I made
 + a logic error,"​ Charpentier recalled. ​ The error, which was 
 + corrected sometime after Charpentier left Commodore, caused ​
 + the early C-64s to generate the wrong number of clock cycles ​
 + on each horizontal video line.  "It was off by one," he said.
 + "​Instead of 65 cycles per line, I had 64."
 + As a result, the 180-degree phase shift between the black-and-white
 + and color information,​ which would have eliminated color transition
 + problems, didn't occur. ​ Depending on their color and the color of
 + the background, the edges of some objects on the screen would appear
 + slightly out of line...
 +There ya go!
 +Machine cycles
 + The tiny CPU speed difference has a number of important ramifications.
 +It means that the CIA timers on a PAL machine run slightly slower than on 
 +an NTSC machine, so timer values may have to be recalibrated;​ note that the
 +system CIA interrupt has a different setting for PAL and NTSC.  Moreover, ​
 +a disk drive runs at exactly 1MHz -- there are no "PAL disk drives"​ -- which 
 +means cycle-exact fastloaders will not synchronize correctly on 
 +different-speed machines. ​ And finally, it means that SID works differently.
 + Not only will the tempo of any interrupt-based tune (raster _or_ CIA) 
 +change, but the actual pitch will change as well.  SID generates its waveforms
 +by simply updating an internal counter every cycle, so an NTSC SID is 
 +essentially playing a digital sample at 1.02 MHz.  When that sample is 
 +played at 0.98 MHz, it's like slowing a record player down a little -- the 
 +pitch decreases. ​ To be specific, the _absolute_ pitch changes, but the 
 +_relative_ pitch between notes does not; the tune plays the same, but
 +the pitches are all a little over a quarter-step lower.  ​
 + Practically speaking, this is totally irrelevant to fixing.
 +Only the tune speed is of significant interest.
 +Fixing the Problems
 + The previous section described different classes of problems ​
 +which can occur from the different cycles per line, lines per frame, ​
 +cycles per frame, and cycles per second. ​ These problems include screen ​
 +syncs, bad interrupts and infinite busy-loops, too many cycles per frame,
 +mis-timed timers and fastloaders,​ and different music tempos.
 + Fixing tunes is easy.  For programs in which the interrupt frequency
 +is otherwise unimportant,​ the interrupt timer source can be adjusted to
 +the correct frequency. ​ In other cases, if the interrupt frequency
 +cannot be changed, the music speed may be adjusted by calling the play
 +routine twice on every sixth interrupt or not calling it on the sixth
 +interrupt. ​ If perfection is needed, the music data might be adjusted or
 +the music routine rewritten to work at the different frequency. ​ Much of
 +the time none of these are necessary, as the music sounds fine at the
 +different speed anyway.  ​
 + Next consider too many cycles per frame. ​ In order to fix this 
 +class of problems, we must improve the efficiency of the code.  There are
 +three cases here.  Sometimes busy waits are used to synchronize the code 
 +with a raster line on the screen. ​ This wastes cycles and imposes some 
 +restrictions which result in additional wasted cycles. ​ These busy waits 
 +may sometimes be replaced by raster interrupts which make better use of 
 +the available cycles.
 + At other times, programmers will set up all graphic updates so that
 +the code executes in the vertical borders. ​ NTSC machines have a big 
 +disadvantage here, as we've seen earlier. ​ Usually this code can be 
 +rearranged to take advantage of the available cycles during the screen ​
 +display. ​ This can be tricky, as you don't want to be updating the screen ​
 +at the same time it is being displayed, but by splitting updates up it can 
 +usually be pulled off.
 + Finally there are some cases in which neither of these techniques ​
 +is of use.  For a perfect fix, the code must be optimized, often by
 +sacrificing memory. ​ If it can't be optimized, then something has to go;
 +effects can be truncated or updates slowed down so that less is updated
 +each frame.
 + Next consider the different cycles per raster line.  In most 
 +cases, this difference is not significant. ​ The routines for which it is
 +significant are raster routines, where synchronization with the video
 +chip is established by using exactly the right number of processor
 +cycles. ​ FLI, VSP, and color bars are all affected by this.  Color bars
 +will have flicker and look crooked; VSP effects will often be shifted
 +the wrong amount; FLI routines usually either repeat the top row of
 +graphics all the way down the screen or else don't display at all. 
 +These problems can all be corrected by adding or subtracting the correct
 +number of cycles per raster line.  In most cases, this class of problem is 
 +actually the easiest to fix.
 + There are several approaches to putting the right number of cycles ​
 +in for the fix.  If the source is available, inserting a NOP instruction
 +may be all that is required for an NTSC fix.  Without the source, a
 +modified routine may be inserted into some empty memory and the original
 +routine bypassed. ​ Sometimes the code may be shuffled around enough to
 +insert a NOP with a machine language monitor. ​ You can sometimes change
 +the opcodes used to use a different number of cycles. ​ Consider the
 +delays that can be added with just two bytes:
 + CMP #$EA ;2 cycles
 + BIT $EA ;3 cycles
 + NOP NOP ;4 cycles
 + INC $EA ;5 cycles
 + CMP ($EA,X) ;6 cycles
 +This gives lots of room to work with, assuming the flags aren't important.
 +If .X or .Y is unused, then
 + STA $1234 
 +can be changed to 
 + STA $1234,X or STA $1234,​Y ​
 +to gain an extra cycle (if .Y=0). ​ If .Y is known to contain a fixed 
 +value like $FF, the target can be adjusted so that STA $1234 becomes ​
 +STA $1135,​Y. ​ Usually in an FLI routine, you'll see STA $D018, STA $D011,
 +and STA $D016 together. ​ Two of these can be replaced with indexed opcodes
 +to add the two extra cycles needed for an NTSC fix.
 + Also note that when sprites are active, a different amount
 +of cycles may get stolen depending on which instruction is executing
 +when the sprite data is read.  C=Hacking #3 discusses this in detail,
 +and explains how it may be used to synchronize code with the raster beam.
 +From a fixing standpoint, it means that you don't always want to add
 +two instruction cycles to convert a piece of PAL code to NTSC; sometimes,
 +as in the demo below, only _one_ cycle must be added to fix certain routines, ​
 +with the other cycle being eaten by VIC.
 + Sprites have a few differences between PAL and NTSC as well; like 
 +the other differences they are not evident in many programs. ​ The horizontal
 +sprite positions start at zero on the left side of the screen and
 +increase as you move to the right. ​ At some point past position 300, the
 +positions wrap back around to the left side of the screen. ​ For these
 +high horizontal positions, the sprites are 8 pixels farther right on a
 +PAL screen than they would be on an NTSC screen. ​ When these high
 +positions are used to create sprites that extend all the way to the left
 +edge of the visible screen, problems show up.  A PAL routine will have
 +an eight pixel wide gap in the sprites on an NTSC machine while an NTSC
 +routine will have eight pixels of overlap on a PAL machine. ​ Often these
 +positions are stored in a table or loaded into a register with immediate
 +addressing. ​ In this case it is trivial to adjust the value by eight in
 +the appropriate direction. ​ The vertical sprite registers present similar
 +behavior, which is seen even less often.
 + As was stated earlier, the 65xx processors in the 64 and 1541 run 
 +at close to the same frequency, but the ratio of the 64's processor ​
 +frequency to the 1541's processor frequency is not exactly 1.0, and the
 +ratio is different on PAL and NTSC systems. ​ This means that although the 
 +6510 in the 64 and the 6502 in the 1541 may be synchronized at the start of 
 +a section of code, after executing the same number of cycles they will no
 +longer be synchronized. ​ Cycles may be added or subtracted on either
 +processor to bring them back into sync, but this adjustment varies
 +depending on either a PAL or NTSC system. ​ This is most often seen in
 +fastloaders,​ where the code depends on the two processors being in sync
 +in order to transmit 8 bits one at a time or four pairs of two bits
 +without the need for handshaking. ​ As with raster routines, a few cycles
 +need to be added or subtracted for a fix.  Finding the right place to
 +insert or remove these cycles can be challenging. ​ Instead of trying to
 +remove cycles from the 64 routine, which may be impossible, you can instead
 +e.g. add cycles to the complementary drive routine.
 +  Finally, understanding what problems may arise and what should be 
 +done to correct them is only part of the skillset needed by a good fixer.  ​
 +In the best case, you'll be working with source code which you've written
 +and understand. ​ In other cases, someone else may have written the code,
 +in which case you must study and understand it before you can start
 +trying to fix any NTSC/PAL problems. ​ Often the source code isn't even
 +available, leaving you to work in a machine language monitor. ​ This
 +imposes some new restraints. ​ In a monitor, it can be difficult to shift 
 +blocks of code around. ​ Several techniques are available to work around ​
 +this.  Code can easily be left out by replacing it with NOP instructions ​
 +or inserting JMPs to branch around sections which aren't needed. ​ Extra 
 +code can be patched in with a JMP instruction to the new code and another ​
 +JMP instruction at its end to return to the old routines. ​ Where cycles ​
 +need to be added, opcodes may be changed; for example, a LDA $ABCD might 
 +be replaced with LDA $ABCD,Y to add an extra cycle. ​ Finally, you may
 +simply have to resort to a symbolic disassembly (i.e. disassemble into
 +source code). ​ This may become an absolute necessity in some cases.
 + Then there are the cases where the object code isn't even readily
 +accessible. ​ Some of the same skills used by crackers to remove copy
 +protection are valuable. ​ Most commonly, games, tools, and demos will be
 +compressed to make the files shorter, adding a single BASIC SYS command
 +line as well.  These routines rarely try to be deceptive and are fairly
 +short, so they can be undone, often just by modifying the existing
 +decompression code a bit.  It is common practice to have a sequence
 +cruncher on top of an RLE (or equal-byte) packer or linker. ​ Within the
 +"​scene",​ there will be intros to wade past, while commercial software
 +will often have disk or tape copy protection and purposefully obtuse code.
 + Just remember that fixing is often about creativity and diligence.
 +There is a common set of problems encountered and many creative solutions ​
 +to correct them.  With practice, it becomes easier, just like anything else.
 +You'll have an easier time of it if you pick your battles carefully.  ​
 +Remember that some problems just can't be fixed perfectly. ​ If you get 
 +stuck, try a different challenge and perhaps come back to the problem ​
 +later. ​ And most of all, remember to have some fun!
 + With all this in mind, let's have a look at that demo!
 +Slow Ideas, page 1
 + Page 1 is divided into basically three parts. ​ The upper part
 +of the screen contains a stretching sprite tech-tech extending into
 +the borders, overlayed on top of raster bars.  Then the rest of the 
 +screen features a large Pu-239 picture/​logo. ​ Finally, a bouncing
 +sprite scroll takes place in the lower portion of the screen, extending
 +into the lower (but not side) borders.
 + The screen is built using two interrupt routines, one located
 +at $1240 and the other located at $1280. ​ The $1240 routine is short
 +and occurs at the bottom of the screen (raster $F8 or so).  It removes
 +the upper/lower borders, performs some calculations,​ and calls the
 +music. ​ The $1280 routine basically controlls the screen, and occurs
 +at raster line $xx, at the top of the screen. ​ It generates the tech-tech ​
 +sideborder display, and performs most of the calculations. ​ Both routines
 +are vectored through $0314, not $FFFE.
 + Fortunately,​ there is a lot of distance between different routines, ​
 +that is, there is a lot of empty memory between routines -- probably it was 
 +coded in an ML monitor. ​ This means that adding patches, or shifting ​
 +routines around, is much easier. ​ What a forward-thinking guy that
 +Pasi was, to realize that it would need to be fixed by C=Hacking one
 + When run, the demo is a mess.  The most prominent defects are
 +that the music plays very slowly, the screen flashes, with the main
 +picture flickering between the top of the screen and the lower portion
 +of the screen, the side borders are not open in the tech-tech, and
 +sometimes the scrolling sprites are off the screen.
 +Too many cycles:
 + The half-speed music and flickering screen indicates that interrupts ​
 +are getting skipped, which points a finger straight at too many cycles -- if
 +the next interrupt is set *after* it is supposed to occur, then a whole
 +frame will of course pass by before it actually occurs.
 + The first question is, where are the cycles getting eaten? ​ That is,
 +of the two interrupt routines, which is using too many cycles. ​ The answer
 +is immediately obvious: any interrupts that take place totally on the screen ​
 +have *extra* cycles available -- 2 cycles per raster line.  The loss of cycles
 +comes in the lower border, which means the $1240 interrupt:
 + jsr tune
 + jsr blah1
 +  jsr blah2
 + LDA $D012
 + CMP #$1E
 + BCC *-7
 +First note that curious $D012 code.  It surely is meant to compare with
 +line $011E, not line $1E.  Line $011E doesn'​t ever occur on an NTSC machine,
 +though, so it actually waits until line $1E, well past the $1280 interrupt.
 +So the first order of business is to BIT that BCC out of existence.
 + Alas, this still does not fix up the flickering. ​ The next step
 +is to BIT ($2C) out the JSR tune call, to see if the problem really is
 +cycles. ​ And sure enough, ditching the tune gives a suddenly stable, or
 +at least mostly stable, screen. ​ BITting out the other two subroutines,​
 +but keeping the music, gives an unstable screen; the tune simply has to 
 +go somewhere else.
 + Finding extra cycles takes a bit of work; for now, it is enough to
 +$2C-BIT out the JSR TUNE and focus on the other problems.
 + Let's have a look at the tech-tech routine:
 + $1280 LDA #$96
 + STA $DD00
 + LDY #$FF
 + BIT $EA
 + NOP
 + NOP
 + LDX #$09
 + DEX
 + BNE *-3
 + $1290 LDX #$5F
 + $1292 LDA $1000,X
 + STA $D018
 + DEC $D016 ;Open border
 + STA $D021
 + INC $D016
 + LDA $1060,X
 + $12A4 STA $D017
 + STY $D017
 + LDA $1100,X
 + STA $D011
 + DEX
 + $12B1 LDA $1000,X
 + STA $D018
 + DEC $D016 ;Open border
 + STA $D021
 + INC $D016
 + LDA $1060,X
 + STA $D017
 + STY $D017
 + $12C9 BIT $EA
 + NOP
 + DEX
 + BPL $1292
 +The routine has three parts: the initial delay at $1280, and a two-part
 +loop, the first part at $1292 and the second at $12B1. ​ The difference
 +between the two is the LDA $1100,X STA $D011 at $12AA; without two
 +parts to the loop, the branch would take too many cycles. ​ Each part will
 +need fixing, since each part uses at least one raster.
 + Opening the side borders requires exact timing, yet the above
 +routine is entered through $0314, which will always have some cycle
 +variance. ​ The trick here is that all eight sprites are active; Pasi
 +himself wrote a nice article in C=Hacking #3 on using sprites to
 +synchronize the raster. ​ The basic idea is to get the CPU to wait on
 +a specific instruction;​ VIC then frees up the bus on a specific cycle,
 +and you know exactly where you are on the raster line.
 + First to fix is the initial line delay. ​ Since there are +2 extra
 +NTSC cycles per raster line, it seems reasonable to first try adding
 ++2 cycles to the delay:
 + 1287 BIT $EA to AND ($00),Y ;+2 cycles
 +      ​NOP NOP
 +      ​NOP NOP
 +Really, CMP ($EA),Y is in general a better choice since it doesn'​t
 +affect .A, but it doesn'​t matter here and we were young and naive,
 +besides. ​ If the above +2 cycles aren't enough, it will be clear
 +soon enough (but it turns out they are enough).
 + Next up is the borders. ​ Although the immediate impulse is
 +to add +2 cycles per raster line -- +2 cycles to both loop parts --
 +it's not obvious where the raster sync is taking place, and +2 cycles
 +might cause the sync to fail.  In fact, what's needed is just +1 cycle 
 +per part:
 + 12A4 STA $D017 to STA $CF18,Y ;+1 cycle
 + 12C9 BIT $EA to INC $00EA ;+1 cycle
 +      NOP
 +The $12A4 fix uses the fact that .Y=$FF. ​ A NOP NOP NOP would have done
 +the trick at $12C9 and be safer, but we used INC in our reckless youth.
 + And suddenly -- poof!  The borders open up and the screen stops
 +flickering. ​ Why should the screen flicker at all?  Remember that the
 +loop is split in two because of an STA $D011 instruction. ​ This STA
 +pushes badlines off, so that the timing stays precise; since badlines
 +never occur, the graphics data is never fetched. ​ It's only after the
 +rasterbars that VIC starts fetching graphics data; without the $D011
 +push, this data will appear on the first visible rasterline (hence the
 +earlier flicker with too many cycles). ​ When an imperfect (from incorrect ​
 +cycle timing) push takes place, the picture can get the jitters. ​ So
 +now we know why the demo behaved as it did, earlier.
 +Sprite scroll:
 + And yet... the screen still flickers, when sprites bounce down
 +too low.  Clearly the sprites are eating into the cycles needed by the
 +lower border routines. ​ The simplest way to fix this is of course to
 +change the y-coordinates of all the sprites. ​ One option is to re-do
 +all the coordinate tables. ​ But a much easier option exists: figure
 +out which code stores the sprite coordinates,​ and subtract a fixed
 +amount from each of the y-coordinates. ​ This routine just happens
 +to be located at $1590, and the simple insertion of code
 + 15AF STA $D001,Y to CLC
 +      ​... SBC #$0E
 + STA $D001,Y
 +fixes things up just dandy.
 +Still too many cycles:
 + We still have the problem of what to do about the tune.  Since
 +there aren't enough cycles in the $1240 interrupt, they need to be
 +found somewhere else, and the only somewhere else is the $1280 interrupt,
 +during the logo display. ​ The first thing to figure out is how many
 +lines are needed, and how many lines are free.  This is easy enough,
 +by simply moving the JSR TUNE to the end of the $1280 routine,
 +sandwiched between an INC $D020 and a DEC $D020. ​ The border will
 +then indicate the end of the $1280 interrupt as well as the size
 +of the tune.
 + The good news is that $1280 has a fair amount of extra
 +cycles available. ​ The bad news is that the music is fairly inefficient,​
 +and needs even more cycles. ​ But not many more -- just a good 8-12 raster
 +lines. ​ $1280 is pretty large, so maybe by rewriting some code enough
 +cycles can be gained to make it all work.  Note that if $1280 only
 +had a few raster lines to spare, this task would be much more difficult
 +(if not impossible).
 + Towards the end of the $1280 routine, there are a series of
 +subroutine calls. ​ One of them is a JSR $1E50. ​ This code has two loops, ​
 +one which copies values from a table at $1900 to a table at $1000, and 
 +one which ORAs a value into the $1000 table. ​ Instead of copying
 +and then ORAing, why not just combine the two loops?
 +1E50 LDY $1FC6
 +1E53 LDX #$5C
 +1E55 LDA $1900,Y
 +1E58 STA $1001,Y
 +1E5B DEY
 +1E5C DEX
 +1E5D BPL $1E55 LDY $1FC6
 +1E5F NOP
 +1E60 LDA $1FC1
 +1E63 STA $1E70
 +1E66 LDA $1FC2
 +1E69 STA $1E6F
 +1E6C LDX #$5F LDX #$5C
 +1E6E LDA $32B6
 +1E71 ORA $1000,​X ORA $1900,Y
 +1E74 STA $1000,​X STA $1001,X
 +1E77 INC $1E6F
 +1E7A BNE $1E8B
 +1E7C INC $1E70
 +1E7F LDA $1E70
 +1E82 CMP #$34
 +1E84 BNE $1E8B
 +1E86 LDA #$30
 +1E88 STA $1E70
 +1E8C BPL $1E6E DEX
 +1E8E RTS BPL $1E6E
 +On the right are the patches we added, along with replacing the JSR $1E50
 +with JSR $1E5D. ​ Instead of copying from $1900 to $1000 and then ORAing ​
 +into $1000, it simply ORAs to $1900 and stores it in $1000 ($1001 actually, ​
 +since that's where the first loop stored stuff). ​ Sharp-eyed readers may 
 +have noticed that the patch affects $1001-$105D,​ whereas the second loop 
 +affected $1000-$105F;​ doesn'​t the patch above lose some bytes? ​ Of course ​
 +it does, what's your point? :).  The ancient hacker technique applies
 +here: try it, and if it works, don't touch it and don't ask questions!
 +Better than a huge rewrite of self-modifying code.
 +Wrapping up:
 + With the above fix in place, and the tune moved to the $1280
 +interrupt, the demo finally seems to work great. ​ All that remains is
 +to save it and crunch. ​ Figuring out which areas of memory are used
 +is easy enough, by looking at the disassembler and the initialization
 +code (which unpacks the code further). ​ But after crunching -- uh oh,
 +lockup. ​ A program freeze shows that it is still running, but that the 
 +interrupts are not occuring. ​ A glance at the setup code shows that $D012 
 +is set, but $D011 is never set -- presumably the high bit is set, so the 
 +interrupt occurs on a nonexistant raster line.  Adding a simple
 + LDA #$1B
 + STA $D011
 +to the initialization routine at $1443 fixes that up just fine, and
 +the program decrunches successfully. ​ Woo-hoo! ​ One page down and
 +one to go.
 +Slow Ideas, page 2
 + Page 2 has essentially four visible parts: a tech-tech at the top of
 +the screen, followed by a swinging FALSTAFF sprite on top of rasters and
 +open borders, followed by the ubiquitous Pu-239 pic/logo, followed finally
 +by a sprite scroll on top of rasters with open side and bottom borders.
 + All this is done with two interrupts, one ($11FE) occuring at
 +raster line $31, the other ($1350) occuring at line $D1.  The $31 raster
 +performs the tech-tech, the FALSTAFF rasters, and also performs some
 +calculations for various sprite scroll effects. ​ The $D1 raster handles the 
 +lower rasters and scrolling sprites, and also plays the music, scrolls the 
 +sprites, and does the calculations for the FALSTAFF sprite.
 + When run, the screen is quite a jumble -- too many rasters -- and
 +needless to say, the screen effects need retiming. ​ The music really sounds ​
 +better at 50Hz, so it needs to be retimed as well.
 +Top to bottom:
 + Before fixing the timing problems, the extra cycles need to be
 +addressed. ​ In the $D1 interrupt, after the lower rasters are displayed
 +there are a series of subroutine calls at $13B3:
 + $13B3 JSR $211C ;Tune
 + JSR $0F80 ;​Scroll sprites
 + JSR $1300 ;​Clear $7Fxx tables (used by JSR $0E00
 + ;  in $31 interrupt)
 + JSR $1D00 ;​FALSTAFF sprite
 +Deducing the function of each routine is easy enough -- just BIT it out
 +and see what happens. ​ BITting out the first three subroutines frees up
 +enough cycles to make the screen stabilize. ​ Finding cycles is usually
 +more work than fixing up timing -- especially three subroutines worth! --
 +so it is enough to $2C-BIT out the three subroutine calls for now and fix 
 +up the timing first.
 + At the top of the screen is a normal tech-tech, controlled by
 +the routine at $11FE. ​ The code flows roughly as follows:
 + $11FE Set up $D020/​$D021,​ waste a few cycles to get timing right
 + LDX #$08
 + $1214 LDA $xxxx,Y
 + STA $D018
 + LDA $xxxx,Y
 + STA $D016
 + INY
 + DEX
 + Change VIC bank ($DD00)
 + CPX #$00
 + BEQ $123F
 + NOP 
 + NOP 
 + NOP
 + ...
 + CPY #$2F
 + BCC $1214
 + JMP $125E
 +The thing to recognize here is that there are two loops -- on every eighth ​
 +line the BEQ $123F branch is taken. ​ A simple cycle count shows that the 
 +first loop takes 63 cycles, and the BEQ branch adds an extra 20 cycles:
 +obviously, these are simply timed for the normal and badlines. ​ So all
 +that is needed here is to add 2 cycles to each loop: I changed the
 +CPX #$00 above to INX DEX for the first loop, and changed a NOP NOP
 +to a CMP ($EA,X) in the $123F branch. ​ With the raster timing correct,
 +the next step is the initial timing at $11FE. ​ By doing a little rearranging
 +of code two bytes can be freed up, giving 2-6 cycles to fiddle with:
 + $11FE LDY #$0B LDX #$0B
 + LDA #$01 LDA #$01
 + STA $D019 STA $D019
 + LDX #$02 LDY #$02
 + $1207 DEX DEY
 + BNE $1207 BNE $1207
 + STY $D020 STX $D020
 + STY $D021 STX $D021
 + LDY #$00 NOP NOP
 + LDX #$08 LDX #$08
 + $1214 tech-tech loop
 +As you can see, by using .Y in the delay loop instead of .X, the LDY #$00
 +instruction becomes redundant. ​ As it turns out, just +2 cycles are needed.
 +Finally, there is the matter of the last raster line.  When the JMP $125E
 +is taken, there is some delay before changing the border and background ​
 +registers, to get a nice solid line.  It turns out that four extra cycles
 +are needed here, and fortunately there just happens to be several padding
 +bytes before $125E. ​ By changing the JMP $125E to a JMP $125C, two
 +extra NOPs are easily inserted. ​ That takes care of the tech-tech.
 +Falstaff rasters:
 + Immediately after the tech-tech are the rasters and open borders
 +behind the FALSTAFF sprite. ​ JSR $0EA0 handles this part.  We already ​
 +fixed a sideborder routine in the first page, and this one is similar:
 + $0EA0 LDX #​$05 ;​Initial delay
 + DEX
 + BNE *-3
 + LDX #$15
 + CMP #$EA
 + $0EA9 BIT $EA ;​Change to NOP NOP for +1 cycle
 + LDA $xxxx,X
 + DEC $D016
 + STA $D021
 + INC $D016
 + STA $D020
 + LDA $xxxx,X
 + STA $D011
 + NOP
 + NOP
 + NOP
 + DEX
 + BNE $0EA9
 + LDX #​$02 ;​Change to LDX #$03 for last line
 + delay loop
 +As before, +1 cycle needs to be added to the border loop, which is
 +easy enough to do by changing the $0EA9 instruction. ​ The end delay
 +also needs a little change, to make the last line nice and solid
 +(there are no sprites active on the last lines). ​ Finally, the
 +initial delay needs to be changed, to line up the routine correctly.
 +I changed a CMP #$EA at $1277, just before the JSR $0EA0 call, to
 +NOP NOP.  As usual in this type of thing, it was easy to simply
 +experiment with the initial timing until the borders opened up.
 +Bottom rasters and sprite scroll:
 + Finally, the bottom rasters need fixing up.  Once again, the
 +loop looks familiar:
 +   $135D LDY #$FF
 + LDX #$55
 + CMP #$EA
 + NOP
 + NOP
 + NOP
 +   $1366 LDA $7F02,X
 + STA $CF18,​Y ;​Stretch if necessary
 + STY $D017
 + LDA $1504,X
 + STA $D011
 + LDA $1C32,X
 + DEC $D016 ;​Open border
 + STA $D021
 + INC $D016
 + DEX
 + BNE $1366
 +And once again, we need to add a cycle. ​ As with page 1, .Y=$FF means
 +that the STA $D011 can be changed to a $CF12,​Y. ​ Changing the CMP #$EA 
 +to EA EA adds the two cycles needed in the init to get everything aligned,
 +and poof -- instant open borders (as long as all the sprites are being
 +displayed correctly).
 + When the scroll is activated, something else becomes apparent:
 +as the sprites scroll into the left border, they "​pop"​ to the left by
 +several extra pixels, i.e. the screen coordinates change by too much.
 +To fix the sprite popping, all that was needed was to add 8 to $0F00
 +(the sprite coordinate).
 +Too many cycles:
 + Alas, the moment we've been dreading has arrived. ​ Where in the
 +world do we find enough cycles for three subroutines? ​ One option is
 +to get rid of some effects -- for example, the sprite scroller has
 +many time-consuming effects that can occur; by getting rid of those,
 +the tune can be placed in the $31 raster. ​ Working through the code
 +doesn'​t reveal subroutines that can easily be rewritten. ​ Still, it
 +sure would be nice to get all the effects going; but where to get the
 + A little thinking suggests something, though -- the PAL
 +routine can't have THAT many extra cycles, simply because it doesn'​t
 +get cycles until the raster routine is finished, at the bottom of
 +the screen... oh duh.  A glance at the raster routine shows that
 +it covers $55 rasters. ​ The interrupt takes place on line $D1,
 +which means that the rasterbars extend all the way down to line 
 +$0127 = 294 or so, which is over 30 rasters past the last NTSC line.
 +So suddenly there'​s a great big chunk of cycles, for free:
 + (Orig PAL) (Fixes)
 +   $135D LDY #$FF
 + LDX #$55 LDX #$35 ;Only $35 raster lines
 + CMP #$EA NOP NOP ;+2 cycles
 + NOP
 + NOP
 + NOP
 +   $1366 LDA $7F02,X LDA $7F22,​X ;​Add $20 to compensate for .X
 + STA $CF18,Y
 + STY $D017
 + LDA $1504,X LDA $1524,​X ;​Add $20 ...
 + STA $D011 STA $CF12,X ;+1 cycle to open borders
 + LDA $1C32,X LDA $1C52,​X ;​Fix probably not needed
 + DEC $D016 ;​Open border
 + STA $D021
 + INC $D016
 + DEX
 + BNE $1366
 +Note that the table offsets need to be increased; without that, the sprites
 +start stretching all over the place and everything goes weird. ​ With $35 
 +raster lines instead of $55, the rasters go all the way down to the last 
 +NTSC lines, and free up just enough cycles for two of the $13B3 subroutines ​
 +to be re-enabled:
 +   $13B3 JSR $211C ;Tune
 + JSR $0F80 ;​Scroll sprites
 + BIT $1300 ;​Clears $7Fxx tables (sprite scroll)
 + JSR $1D00 ;​FALSTAFF swing
 +That pesky $1300 subroutine just pushes it over the edge, though:
 + $1300 LDX #$2F
 + LDA #$00
 + STA $7F01,X
 + STA $7F2D,X
 + DEX
 + BNE *-9
 +It doesn'​t take *that* many cycles, so one option is to use even less 
 +than $35 rasters. ​ But this makes the bottom part look awfully small, and
 +obscures parts of the scroll.
 + If you think about it for a moment, there are some free cycles
 +still hanging around elsewhere in the code: in the tech-tech routine!
 +There are lots of NOPs inside of the tech-tech loops; enough to piggy-back
 +the $1300 routine into it without much effort:
 + $11FE LDX #$0B LDX #$0B
 + LDA #$01 LSR $D019 ;​Still 6 cycles, -2 bytes
 + STA $D019 LDY #$02
 + LDY #$02 DEY
 + $1207 DEY BNE $1205
 + BNE $1207 STA $D020
 + STX $D020 STX $D021
 + STX $D021 LDX #$08
 + $1210 NOP NOP
 + $1212 LDX #$08 LDA $1887,Y
 + LDA $1887,Y STA $D018
 + STA $D018 LDA $1B87,Y
 + LDA $1B87,Y STA $D016
 + STA $D016 INY
 + INY ROL ;​Replaces LSR LSR ...
 + NOP AND #$03
 + NOP STA $DD00
 + LSR LDA #$00
 + LSR STA $7F01,Y
 + LSR STA $7F2A,Y
 + LSR BEQ $123F ;​Every 8th raster
 + LSR
 + STA $DD00
 + INX
 + DEX
 + BEQ $123F
 +Using LSR $D019 instead of LDA #$01 STA $D019 saves 2-bytes (6502
 +instructions like LSR use a "​read-modify-write"​ cycle, and write #$FF
 +to the register while LSR is working, which is why LSR $D019 clears $D019!)
 +while still using 6 cycles. ​ Replacing the LSR LSR ... stuff with the
 +ROL code saves another byte.  Moving the DEX down saves two more bytes,
 +and getting rid of the NOPs saves three more, for a total of 8 bytes --
 +just enough for the LDA STA STA which replaces the JSR $1300 code.
 +The new code uses two fewer cycles, however -- but by rearranging the
 +initialization code we can just branch to a NOP at $1211 above.
 + Every eighth raster the branch is made to $123F. ​ This routine
 +performs another INY, which has the effect of skipping every eighth entry 
 +in the $7Fxx tables. ​ Thus extra STAs need to be added:
 +   $123F BIT $EA BIT $EA
 + CMP ($EA,X) STA $7F02,Y
 + NOP STA $7F2B,Y
 + NOP
 + ...
 + The two STA xxxx,Y each take 5 cycles, so 10 cycles exactly replaces ​
 +the 10 cycles used by CMP ($EA,X) NOP NOP.  But since the STAs take two more 
 +bytes, the rest of the routine needs to be moved forwards; luckily, there 
 +are empty padding bytes immediately following the routine, so it moves
 +forwards without a hitch...
 + Well, maybe with one hitch. ​ The whole tech-tech routine uses
 +self-modifying code all over the place. ​ The modifying instructions
 +immediately follow the routine, and so need to be adjusted to the
 +new locations.
 + Finally, for some strange reason, the above fixes also affect
 +the FALSTAFF rasters. ​ Adding an extra +2 cycles to the intial FALSTAFF
 +delay fixes that up, though.
 + So much for extra cycles!
 +The Tune:
 + Finally, the tune really does sound better at 50Hz, so a 5/6
 +delay needs to be added. ​ There is some free memory at $09xx which is
 + $09E0 DEC $09FF
 + BEQ *+3
 + JMP $211C ;Play tune
 + LDA #$05 ;Skip tune every 6th frame
 + STA $09FF
 + RTS
 +As the comments say, it simply skips every sixth frame of the tune, to
 +give it an effective play rate of 50Hz.  It works!
 + Unfortunately,​ there are still a few glitches in the fix.  The
 +bottom rasters have some flicker in the upper right (and occasionally upper-
 +left) corners. ​ Pretty minor stuff, but still annoying. ​ Fixing means
 +rewriting the raster loop.
 + Much worse, though, is that the screen will glitch badly when
 +certain sprite scroll effects are activated. ​ This glitch seems to involve
 +quite a lot of rasters, since even with the tune disabled glitches are
 +still evident. ​ One choice is to reduce the number of bottom rasters,
 +but that doesn'​t look very good.  Another is to eliminate effects, but
 +that's no good.  A third is to do a big code rewrite -- bleah. ​ But the
 +easiest is to just call it a 95% fix and say "good enough"​. ​ Good enough.
 + And there it is -- Our First Fix.  Hope u lik3 1t, d00dZ! ​ Next 
 +time, we'll have a look at some of the routines which weren'​t covered this 
 +time, such as fastload and FLI routines. ​ Until then, have fun fixin!
 +.                                    C=H #17
 +begin 644 slowideas-ntsc
 +The Herd Mentality as gleaned from the musings of Bil Herd
 +This is sort of a story that was really a bunch of posts I did on compuserve ​
 +back in 1993. The language is coarse, the English malformed and if you'​re ​
 +a certain ex-Commodore employee you might be offended by the story. ​ In 
 +that case, the story is probably about someone else and not you; my 
 +fallback position would be that the whole story is a fabrication and meant
 +for entertainment,​ not to mention that worrying about things like 5 year 
 +stories written by burnt-out wackos is bad for your cholesterol. ​ As it 
 +turns out, this is about a sixth of the sillyness that went on, this is 
 +more about some of the events that centered on the 8563 chip which I was 
 +being asked about at the time.  Oh yeah, the content is also dated with 
 +a slight slant towards the melodramatic. ​
 +12-Jan-93 ​ 19:28:05
 +Coming soon to a terminal near you... the gruesome story of the chip that
 +almost ruined CES (and the C128 along with it)
 +EXPERIENCE the shame and horror of being a Chip Designer at Commodore during
 +the Witch hunts!
 +SEE the expressions on Managers faces when they realize that their Bonuses are
 +at stake!
 +HEAR the woeful lamenting of the programmers as they are beaten for no
 +apparent reason!
 +SHARE the experience of being a Hardware Engineer... stalking the halls in
 +search of programmers to beat (for no apparent reason).
 +LEARN how to say "THIS CHIP ONE SICK PUP" in Japanese.
 +Find out just how badly busted up the 80 column chip was and how many DIRTY
 +fixes were needed to make that all crucial show in Vegas on January 6.
 +(Christmas, what Christmas). .
 +<Said rather coyly in an attempt to elicit any positive responses>​ Unless of
 +course no one is interested..... :) Bil
 +14-Jan-93 ​ 15:37:59
 +This is the first of many parts as this thing went round and round during our
 +mad dash to make the CES show. I don't even remember what year it was. The
 +8563 was a holdover from the Z8000 based C900 (the "​Z"​ machine as we called
 +it). The people who worked on it were called the "​Z"​ people, the place they
 +hung out was called the "​Z"​ lounge and well.... you get the idea.
 +The most interesting thing that came out of that group besides a disk
 +controller that prompted you for what sector and cylinder you'd like to write
 +to on every access, was one day they stole the furniture out of the lobby and
 +made their own lounge disguising it as a VAX repair depot. We were so
 +amused by this that we stopped teasing them for a week.  (But I become
 +Now the very very very early concept of the C128 was based on the D128, a
 +6509 based creature (boo... hiss). ​ The engineers on the project had tacked a
 +VIC chip onto the otherwise monochrome (6845 based) in an effort to add some
 +color to an otherwise drab machine. No one dreamed that C64 compatibility was
 +possible so no one thought along those lines. I was just coming off of
 +finishing the PLUS 4 (before they added that AWFUL built in software to it)
 +and even though I had done exactly what I was told to do I was not happy with
 +the end result and had decided to make the next machine compatible with
 +_something_ instead of yet another incompatible CBM machine. (I won't go into
 +the "yes Virginia there is Compatibility"​ memo that I wrote that had the
 +lawyers many years later still chuckling, suffice it to say I made some
 +fairly brash statements regarding my opinion of product strategy)
 +Consequently,​ I was allowed/​forced to put my money where my mouth was and I
 +took over the C128 project.
 +I looked at the existing schematics once and then started with a new design
 +based on C64ness. The manager of the chip group approached me and said they
 +had a color version of the 6845 if I was interested in using it it would
 +definitely be done in time having been worked on already for a year and a
 +half...... And so the story begins..... (to be continued)
 +16-Jan-93 ​ 19:06:28
 +Looking back I realize that the source of a lot of the problems with the 8563
 +is that it wasn't designed FOR the C128 and that the IC designers did not
 +take part in the application of their chip the way the other designers did.
 +The VIC and MMU designers took an active interest in how their chip was used
 +and how the system design worked in relation to their chip. I overlooked
 +ramifications of how the 8563 was spec'​ed to work that came back to haunt me
 +later. For example, it was explained to me how there was this block transfer
 +feature for transferring characters for things like scrolling. ​ Cool.... we
 +need that. Later it would turn out when this feature finally did work
 +correctly that it only was good for 256 characters at a time. 256 characters
 +at a time. 256 characters at a time?? I never stopped to think to ask if the
 +feature was semi-useless because it could only block move 3 and 1/3 lines at
 +a time. Did I mention the block move was only good for 256 characters? ​ Later
 +a bug in this feature would almost prove a show stopper with a serious
 +problem showing up in Vegas the night of setup before the CES show. But I get
 +ahead of myself. It was also my understanding that this part had the same
 +operating parameters as the 6845, a VERY common graphics adapter. ​ Not
 +scrutinizing the chip for timing differences the way I normally did any new
 +chip was another mistake I made.  The major timings indicated what speed
 +class it was in and I didn't check them all. I blame myself as this really is
 +the type of mistake an amateur makes. ​ I wonder if I was in a hurry that day.
 +16-Jan-93 ​ 19:06:39
 +It turns out that a major change had been made to the way the Read/Write line
 +was handled. When I asked about this, VERY late in the design cycle, like in
 +Production when this problem turned up, I was told "​remember,,​ this was
 +designed to work in the Z8000 machine."​ ???!!!! ????!!!! Shoulda seen the
 +look on my face! Even though the Z8000 machine was long dead and we had been
 +TRYING for 6 months to use this damn thing in the C128 I'm being told NOW
 +that you didn't design it to work the way we've been using it for 6 months?
 +Shoulda asked.... it was my fault, shoulda asked "is this meant to work"​.....
 +Don't get me wrong, the designer was VERY bright, he held patents for some of
 +the "​cells"​ in the Motorola 68000. It just that chip had to work in
 +conjunction with other chips and that's where some of the problems lay.  Our
 +story opens as Rev 0 of the chip.... (what'​s that..... doesn'​t work.... OK,)
 +Our story opens as Rev 1 of the chip makes its debut and ......(pardon me a
 +moment.....) Our story opens as Rev 2 of the chip makes it debut..... <to be
 +19-Jan-93 ​ 20:50:41
 +Forgive the sporadic nature of these additions. ​ Now where was I .... oh
 +yeah.... It was sometime in September when we got 8563 Silicon (or so memory
 +serves) good enough to stick in a system. I can't remember what all was wrong
 +with the Chip but one concern we had was it occasionally (no spell checker
 +tonight, bear with me) blew up.... big time.... turn over die and then smell
 +bad..... But then all of the C128 prototypes did that on a semi regular basis
 +as there wasn't really any custom silicon yet, just big circuit boards
 +plugged in where custom chips would later go... but you can't wait for a
 +system to be completed before starting software development. ​  I don't think
 +any of the Animals really gave it a thought until when the next rev of the
 +chip came out and now with less other problems the blowing up '​seemed'​ more
 +pronounced. ​ Also the prototypes got more solid _almost_ every day.  (I knew
 +to go check on the programmer'​s prototype whenever I heard the sound of cold
 +spray coming out of their office.... later it turned out they usually weren'​t
 +spraying the boards just using their "​Hardware Engineer"​ call.  Sometimes all
 +I had to do was touch the board in a mystical way and then back out slowly
 +sometimes accompanied by ritual like chanting and humming. ​ This became know
 +as the "​laying of hands"​. ​ This worked every time except one, and that time
 +it turned out I had stolen the power supply myself without telling them....
 +If anybody else got caught "​messing with my guys" they'd get duct taped to a
 +locker and then the box kicked out from under them leaving them stuck until
 +they could peel themselves down, but that's another story.) ​ ANYWAY, when this
 +problem still existed on Rev 4 (I think it was) we got concerned. ​ It was at
 +this time that the single most scariest statement came out of the IC Design
 +section in charge of the '​63. ​ This statement amounted to "​you'​ll always have
 +some chance statistically that any read or write cycle will fail due to
 +19-Jan-93 ​ 21:12:05
 +Synchronicity problems occur when two devices run off of two separate clocks,
 +the VIC chip hence the rest of the system, runs off of a 14.318Mhz crystal
 +and the 8563 runs off of a 16Mhz Oscillator. Now picture walking towards a
 +revolving door with your arms full of packages and not looking up before
 +launching yourself into the doorway. You may get through unscathed if your
 +timing was accidentally just right, or you may fumble through losing some
 +packages (synonymous to losing Data) in the process or if things REALLY foul
 +up some of the packages may make it through and you're left stranded on the
 +other side of the door (synonymous to a completely blown write cycle). What I
 +didn't realize that he meant was that since there'​s always a chance for a bad
 +cycle to slip through, he didn't take even the most rudimentary protection
 +against bad synchronizing. It's MY FAULT I didn't ask, "what do you mean
 +fully by that statement"​ because I'd of found out early that there was NO
 +protection. As it turns out the 8563 instead of failing every 3 years or so
 +(VERY livable by Commodore standards) it failed about 3 times a second. ​ In
 +other words if you tried to load the font all in one shot it would blow up
 +every time!  The IC designers refused to believe this up until mid December
 +(CES in 2-3 weeks!) because "their unit in the lab didn't do it." ​ Finally I
 +said "show us" and they led the whole rabble (pitch forks, torches, ugly
 +scene) down to the lab.  It turns out they weren'​t EVEN TESTING THE CURRENT
 +REV of the chip, (TWO revs old), they were testing it from Basic because it
 +"blew up" every time they ran it at system speeds (No %^$#%$# sherlock.
 +That's what we're trying to tell you) and even then it screwed up once and
 +the designer reached for the reset switch saying that something does
 +occasionally go wrong. ​ Being one of the Animals with my reflexes highly
 +tuned by Programmer Abusing I was able to snatch his arm in mid-air before he
 +got to the reset switch, with blatant evidence there on the test screen.
 +19-Jan-93 ​ 21:12:15
 +One of the rabble was their boss and (I have been speaking about two
 +designers interchangeably,​ but then they were interchangeable,​) the word
 +Finally came down "FIX IT"​. ​ Hollow Victory as there was only two weeks till
 +we packed for the show, and there were 4 or 5 other major problems (I'll say
 +more later) with the chip and NO time to do another pass. It was obvious that
 +if we were going to make CES something had to give. As Josey Wales said,
 +"Thats when ya gotta get Mean.... I mean downright plumb crazy Loco Mean".
 +And we knew we had to.  The programmer thrashing'​s hit a all time high shortly after.
 +22-Jan-93 ​ 14:17:32
 +Memory flash, I just remembered when we found out there was no interrupt
 +facility built in to the 8563.  I remember how patient the designer was when
 +he sat me down to explain to me that you don't need an interrupt from the
 +8563 indicating that an operation is complete because you can check the
 +status ANY TIME merely by stopping what you're doing (over and over) and
 +looking at the appropriate register, (even if this means banking in I/O) or
 +better yet sit in a loop watching the register that indicates when
 +an operation is done (what else could be going on in the system besides
 +talking to the 8563 ???)  Our running gag became not needing a ringer on the
 +phone because you can pick it up ANY TIME and check to see if someone'​s on
 +it, or better yet, sit at your desk all day picking the phone up.   Even in
 +the hottest discussions someone would suddenly stop, excuse himself, and pick
 +up the nearest phone just to see if there was someone on it.  This utterly
 +failed to get the point across but provided hours of amusement. ​ The owners
 +at the local bar wondered what fixation the guys from Commodore had with the
 +pay phone.
 +Any ways.... To back up to the other problems that plagued the 8563.  Going
 +into December a couple of things happened. ​ The design had been changed to
 +support a "​back-bias generator"​. ​ This thing is generally used to reduce
 +power consumption and speed the chip up.  Well, something was not quite right
 +somewhere in the design because the chip got worse. ​ The second thing that
 +happened was that both designers took vacation. ​ Nothing against that from my
 +point of view here 8-9 years in the future, but right then we couldn'​t
 +understand what these people were doing working on a critical project.
 +22-Jan-93 ​ 14:17:37
 +Or maybe I was just getting to used to eating Thanksgiving Dinner out of
 +aluminum foil off of a Lab Bench. ​ Christmas consisted of stopping at
 +someone'​s house who lived in the area for a couple of hours on the way home
 +from work.  Anyways, the chips could no longer display a solid screen. ​ The
 +first couple of characters on each line were either missing or tearing, until
 +the thing heated up, then they were just missing. ​ Also, the yield of chips
 +that even worked this good fell to where they only got 3 or 4 working chips
 +the last run.  A run is a Half-Lot at MOS and costs between $40,000 and
 +$120,000 to run.  Pretty expensive couple of chips.
 +The other problem takes a second to explain, but first a story..... ​ Back
 +when TED (the Plus four) had been mutilated decimated and defecated upon,
 +management decided to kick the body one last time.  "TED shall Talk" came the
 +decree and the best minds in the industry were sought... We actually did have
 +two of the most noted consumer speech people at the time, the guys who
 +designed the "TI Speak an Spell" worked out of the Commodore Dallas office.
 +They did a custom chip to interface a speech chip set to the processor.
 +Operating open loop, in other words without feedback from any of the system
 +design people (US) they defined the command registers. ​ There was a register
 +that you wrote to request a transfer. ​ To REALLY request the transfer you
 +wrote the same value a second time. We referred to this as the "do it, do it
 +now" register or the "come on pretty please"​ request, or my favorite, "those
 +#$%&@ Texans"​ register. ANYWAYS, the 8563 also had a problem where the 256
 +'​bite'​ transfer didn't always take place properly, leaving a character
 +behind. This ended up having the effect of characters scrolling upwards
 +22-Jan-93 ​ 14:17:45
 +So to recap, going into December we had a chip with .001% yield, the left
 +columns didn't work, anytime there was one pixel by itself you couldn'​t see
 +it, the semi useless block transfer didn't work right, the power supply had
 +to be adjusted for each chip, and it blew up before you loaded all of the
 +fonts unless you took 10 seconds to load the fonts in which case it blew up
 +only sometimes. Finger pointing was in High swing, (the systems guys should
 +have said they wanted WORKING silicon) with one department pitted against the
 +other, which was sad because the other hardworking chip designers had
 +performed small miracles in getting their stuff done on time. Managers
 +started getting that look rabbits get in the headlights of onrushing Mack
 +trucks, ​ some started drinking, some reading poetry aloud and the worst were
 +commonly seen doing both. Our favorite behavior was where they hid in their
 +offices. ​ It was rumored that the potted plant in the lobby was in line for
 +one of the key middle management positions. Programmer beatings had hit a new
 +high only to fall off to almost nothing overnight as even this no longer
 +quelled the growing tension. A sprinkler head busted and rained all over
 +computer equipment stored in the hallway. Engineering gathered as a whole and
 +watched on as a $100,000 worth of equipment became waterlogged,​ their
 +expressions much like the bystanders at a grisly accident who can't tear
 +their attention away from the ensuing carnage. ​ I can honestly say that it
 +didn't seriously occur to me that we wouldn'​t be ready for CES, for if it
 +had, I might have succumbed to the temptation to go hide in my office
 +(checking the telephone). ​ There were just too many problems to stop and
 +think what if.  Next time (hopefully) I'll try and bring all the problems and
 +answers together and explain why I stopped to tell that rather out of place
 +TED story.
 +30-Jan-93 ​ 19:27:11
 +No single custom chip was working completely as we went into December with
 +the possible exception of the 8510 CPU.  The MMU had a problem where data was
 +"​bleeding through"​ from the upper 64K bank into the lower. ​ This was in part
 +due to a mixup in the different revision of "​layers"​ that are used to make
 +chips. This chip essentially had one of the older layers magically appear
 +bring old problems with it.  Unfortunately,​ this older layer had been used to
 +fix newer problems so we didn't have a way to combine existing layers to fix
 +ALL problems. Dave D'Orio (start telling ya some of the names of a few of the
 +unsung types here) did a GREAT job of bringing most of the IC design efforts
 +together. ​ I was sitting with Dave in a bar, we were of course discussing
 +work, when he suddenly figured out what the problem was.  He had looked at
 +the bad MMU chip under a microscope that day.  Later that night, under the
 +influence of a few Michelobs, his brain "​developed"​ the picture his eyes had
 +taken earlier and he realized that an earlier layer had gotten into the
 +30-Jan-93 ​ 19:49:06
 +This would not be the first time a problem would be addressed at this
 +particular bar.  (The Courtyard.... If you ever saw the David Letterman where
 +the guy stops the fan with his tongue, he was a bartender there). The PLA had
 +a problem where my group had made a typo in specifying the hundred some terms
 +that comprised the different operating parameters. Well the designer in
 +charge of the PLA took this rev as an opportunity to sneak a change into the
 +chip without really going public with the fact he was making a change. ​ When
 +the change went through it caused one of the layers to shift towards one side
 +and effectively shorted the input pins together. Ya should'​ve seen the seen
 +where the designer'​s boss was loudly proclaiming that Hardware must of
 +screwed up because his engineer DIDN't make any changes (that would'​ve been
 +like admitting that something had been "​broken"​). You could tell by the way
 +the designer'​s face was slowly turning red that he hadn't yet found a way of
 +telling his boss that he had made a change. Talk about giving someone enough
 +rope to hang themselves, we just kept paying it out yard by yard.
 +30-Jan-93 ​ 19:53:45
 +Anyways back to the 8563.  The first problem was relatively easy to fix,
 +providing you didn't give a hang about your own self respect. ​ The 8563
 +designer mentioned that the block copy seemed to work better when you wrote
 +the same command twice in a row.  I made him explain this to me in public,
 +mostly due to the mean streak I was starting to develop when it came to this
 +particular subject. ​ He calmly explained that you merely wrote to this
 +register and then wrote to it again. ​ I asked "you mean do it and do it now?"
 +"​Exactly",​ the designer exclaimed figuring he was on the home stretch ​ to
 +understanding (Intel, at last his eyes unfurled), "kinda like a 'come on
 +pretty please register'​ I asked with my best innocent expression, "Well sort
 +of" he replied doubt creeping in to his voice, "you wouldn'​t be from Texas
 +would you", I asked my face the definition of sincerity, (said in the voice
 +of the wanna-be HBO director on the HBO made for TV commercial) "why yes....
 +yes I am" he replied. Mind you a crowd had formed by this time, that poor guy
 +never understood what was so funny about being from Texas or what a 'Damm
 +Texan' register was.
 +30-Jan-93 ​ 19:53:50
 +This '​fix'​ actually did work some what, the only problem was that no one told
 +the guy (Von Ertwine) who was developing CP/M at home (consultant). Von had
 +wisely chosen not to try to follow all of the current Revs of the 8563,
 +instead he latched onto a somewhat working Rev4 and kept if for software
 +development. ​ Later we would find out that Von, to make the 8563 work
 +properly, was taking the little metal cup that came with his hot air popcorn
 +popper (it was a buttercup to be exact) and would put an Ice cube in it and
 +set it on the 8563.  He got about 1/2 hour of operation per cube. On our side
 +there was talk of rigging cans of cold spray with foot switches for the CES
 +show, "​sparkle???​ I don't <​pissshhh>​ see any sparkle <​pissshhh>"​. Anyways,
 +no-one told Von.... but don't worry, he would find out the day before CES
 +during setup in '​Vegas.
 +    23-Oct-93 ​ 16:57:43
 +Sb: C128, The Final Chapter
 +Thought I'd finish what I'd started back in January of this year. I had been
 +talkin 'bout how busted up the 8563, now we get to the part about how it got
 +fixed... well fixed good enough... well patched good enough to give every
 +possible attempt at the appearance of maybe passably working...
 +One of the things that got worse instead of better was something called the
 +back bias generator. ​ Now as much as I admired the blind ambition (as opposed
 +to unmitigated gall... no one ever said it was unmitigated gall and I am not
 +saying that here and now) of slipping in a major change like that right
 +before a CES show, it became obvious that it needed fixed. ​ Now the back-bias
 +generator connects to the substrate of the chip and if you've ever seen the
 +ceramic versions of the 40 and 48 pin chips you would notice that the pin 1
 +indicator notch is gold colored. ​ That is actually a contact to the
 +substrate. ​ I have never heard of anyone ever soldering to the pin 1
 +indicator notch but I had little to lose.  At this point all I did have to
 +lose was a HUGE jar of bad 8563'​s. (One night a sign in my handwriting
 +"​appeared"​ on this jar asking "Guess how many working 8563's there are in the
 +jar and win a prize." ​ Of course if the number you guessed was a positive
 +real number you were wrong.) I soldered a wire between this tab and the
 +closet ground pin.  The left column reappeared though still a little broken
 +up!  The "​EADY"​ prompt now proudly stated that the machine was "​READY"​ and not
 +really proclaiming it's desire to be known as the shortened version of
 +Edward. ​ To  fix the remaining tearing we put 330 ohm pullups on the outputs
 +and adjusted the power supply to 5.3 volts. ​ This is the equivalent of
 +letting Tim-the-Tool-Man-Taylor soup up your blender with a chainsaw motor
 +but it worked. ​ The side effect was that it would limit the useful life of
 +the part to days instead of weeks as was the normal Commodore Quality
 +Standard. ​ I was afraid that this fix might be deemed worthy for production.
 +(said with the kind of sardonic cynical smile that makes parole officers
 +really hate their jobs)
 +Remember the synchronicity problem? ​ Remember the revolving door analogy? ​ We
 +built a tower for the VIC chip that had something called a Phase Lock Loop on
 +it which basically acted as a frequency doubler. ​ This took the 8.18 Mhz Dot
 +Clock (I think it was 8.18 Mhz.... been too long and too many other dot clock
 +frequencies since then) and doubled it.  We then ran a wire over to the 8563
 +and used this new frequency in place of its own 16 Mhz clock. ​ Now this is
 +equivalent to putting a revolving door at the other end of the room from the
 +first door and synchronizing them so that they turn at the same rate.  Now if
 +you get through the first door and walk at the right speed every time towards
 +the second door you will probably get through. ​ This tower working amounted
 +to a True Miracle and was accompanied by the sound of Hell Freezing over, the
 +Rabbit getting the Trix, and several instances of Cats and Dogs sleeping
 +together. This was the first time that making CES became a near possibility.
 +We laughed, we cried, we got drunk. ​ So much in hurry were we that the little
 +3" X 3" PCB was produced in 12 hours (a new record) and cost us about $1000
 +A new problem cropped up with sparkle in multi-colored character mode when
 +used for one of the C64 game modes. ​ Getting all too used to this type of
 +crises , I try a few things including adjusting the power supply to 4.75
 +volts. ​ Total time-to-fix,​ 2 minutes 18 seconds, course now the 80 column
 +display was tearing again. Machines are marked as to whether they can do 40
 +column mode, 80 column mode or both. We averaged 1-3 of these crises a day
 +the last two weeks before CES.  Several of us suffered withdrawal symptoms if
 +the pressure laxed for even a few minutes. ​ The contracted security guards
 +accidentally started locking the door to one of the development labs during
 +this time.  A hole accidentally appeared in the wall allowing you to reach
 +through and unlock it.  They continued to lock it anyways even though the
 +gaping hole stood silent witness to the ineffectiveness of trying to lock us
 +out of our own lab during a critical design phase. ​ We admired this
 +singleness of purpose and considered changing professions.
 +We finished getting ready for CES about 2:00 in the morning of the day we
 +were to leave at 6:00.  On the way to catch the couple of hours sleep I hear
 +the Live version of Solsbury Hill by Peter Gabriel, the theme song of the
 +C128 Animals and take this as a good omen. Several hapless Programmers are
 +spared the ritual sacrifice this night... little do they know they owe their
 +lives to some unknown disc jockey.
 +Advertisements in the Las Vegas airport and again on a billboard enroute from
 +the airport inform us that the C128 has craftily been designed to be
 +expandable to 512K.  Now it had been designed to be expandable originally and
 +had been respecified by management so as to not be expandable in case next
 +year's computer needed the expendability as the "​New"​ reason to buy a
 +Commodore computer. That's like not putting brakes on this years model of car
 +so that next year you can tote the New model as reducing those annoying
 +head-on crashes.
 +Upon arriving at the hotel we find that out hotel reservations have been
 +canceled by someone who fits the description of an Atari employee. ​ Three
 +things occur in rapid succession. ​ First I find the nearest person owning a
 +credit card and briskly escort her to the desk were I rented a room for all
 +available days, second, a phone call is placed to another nearby hotel
 +canceling the room reservations for Jack Trameil and company, third, several
 +of those C64's with built in monitors (C64DX'​s???​ man it's been too long) are
 +brought out and left laying around the hotel shift supervisors path
 +accompanied by statements such as "My my, who left this nifty computer laying
 +here... I'd bet they wouldn'​t miss it too much".
 +The next day we meet up with the guy who developed CPM (Von) for the C128.
 +As I mentioned earlier, someone forgot to tell him about the silly little
 +ramifications of an 8563 bug.  His 'puter didn't do it as he had stopped
 +upgrading 8563s on his development machine somewhere around Rev 4 and the
 +problem appeared somewhere around Rev 6.  As Von didn't carry all the
 +machinery to do a CPM rebuild to fix the bug in software, it looked like CPM
 +might not be showable. ​ One third of the booth'​s design and advertising was
 +based on showing CPM.  In TRUE Animal fashion Von sat down with a disk editor
 +and found every occurrence of bad writes to the 8563 and hand patched them.
 +Bear in mind that CPM is stored with the bytes backwards in sectors that are
 +stored themselves in reverse order. Also bear in mind that he could neither
 +increase or decrease the number of instructions,​ he could only exchange them
 +for different ones.  Did I mention hand calculating the new checksums for the
 +sectors? ​ All this with a Disk Editor. I was impressed.
 +Everything else went pretty smooth, every supply was adjusted at the last
 +moment for best performance for that particular demo.  One application has
 +reverse green (black on green) and the 330 ohm pullups won't allow the
 +monitor to turn off fast enough for the black characters. ​ I had had
 +alternate pullup packs made up back in West Chester and put them in to
 +service. ​ On the average,2 almost working 8563's would appear each day, hand
 +carried by people coming to Vegas. Another crisis, no problem, this was
 +getting too easy.  If a machine started to sparkle during the demo, I would
 +pull out my ever present tweak tool and give a little demonstration as to the
 +adjustability of the New Commodore power supplies. ​ People were amazed by
 +Commodore supplies that worked, much less had a voltage adjustment and an
 +externally accessible fuse.  I explained (and meant it) that real bad power
 +supplies with inaccessible fuses were a thing of Commodore'​s past and that
 +the New design philosophy meant increased quality and common sense.
 +I'm told they removed the fuse access from production units the month after I
 +left Commodore.
 +The names of the people who worked on the PCB layout can be found on
 +the bottom of the PCB.
 +The syntax refers to an inside joke where we supposedly gave our lives in an
 +effort to get the FCC production board done in time, after being informed
 +just the week before by a middle manager that all the work on the C128 must
 +stop as this project has gone on far too long.  After the head of Engineering
 +got back from his business trip and inquired as to why the C128 had been put
 +on hold, the middle manger nimbly spoke expounding the virtues of getting
 +right on the job immediately and someone else, _his_ boss perhaps, had made
 +such an ill suited decision. ​ The bottom line was we lived in the PCB layout
 +area for the next several days.  I slept there on an airmatress or was
 +otherwise available 24 hours a day to answer any layout questions. ​ The
 +computer room was so cold that the Egg Mcmuffins we bought the first day were
 +still good 3 days later.
 +About the Z80:
 +What court ordered Commodore to install the Z80?
 +It wasn't mandated by court order, it was mandated by a 23 year old
 +engineer that realized that marketing had gone and said that we were 100%
 +compatible. ​ This turned out to be a hard nut to crack as no-one knew
 +what C64 compatibility meant. ​ Companies who designed cartridges for the
 +C64 used glitches to clock their circuitry not realizing that the glitches
 +were not to be depended on, etc.
 +The Z80/CPM cartridge didn't work on all C64's, and no-one had really taken
 +the time to figure out why.  Someone noticed that a certain brand of the
 +address buffer used in the CPM cart worked better than others so someone
 +concluded that it must be the timing parameters that made a difference.
 +This wasn't true, it was a very subtle problem that dealt with the way the
 +6502, the Z80 and the DRAM had been interlaced together. ​ So here we had
 +a CPM cart that didn't work with all C64's and it worked even
 +less reliably with the C128 even though the timing parameters in the C128
 +were far better. ​ In my opinion you couldn'​t call the C128 compatible
 +with the CPM cart as it only ran 20% of the time when tested overnight.
 +ALSO, I worked hard to make sure the C128 had a reliable power supply. ​ I
 +was told "no fuse'​..... oops one got in there by accident... in fact it
 +was easily accessible... darn it anyway. ​ However, with the wide
 +variations in minimum and maximum power supply requirements we couldn'​t
 +handle the CPM cart, it needed an additional .5 amp because of some
 +wasteful power techniques that were used in it.  I couldn'​t foot the
 +bill for an additional .5 amp that might only occasionally be used.
 +SO, with that said, I accidentally designed the Z80 into the next rev of
 +the board. ​ We designed the C128 in 6 months from start to finish
 +INCLUDING custom silicon, these were records back then, the Z80 was added
 +around the second month.
 +The Z80 normally calls for a DRAM cycle whenever it needs one... it might
 +go 3 clocks and then 4 and then 6 and then 5 and then 7  between dram
 +accesses. Since the processor shares the bus with the VIC chip there are
 +only certain time when the bus is available for a DRAM cycles. ​ Since the
 +shortest cycle for a DRAM access for the Z80 is 3 clock cycles, you are
 +sure to catch a DRAM access if you do 2 cycles (wait for vic) then 2
 +cycles (wait for vic).  Whether you catch the Z80 between clock 1 and 2 or
 +between 2 and 3 doesn'​t matter due to the special circuitry in the design.
 +Otherwise if you just let the Z80 rip it crashes when it tries to grab
 +DRAM while there is a video(vic) cycle going on.  And that's why it runs
 +at a clock-stretched 2MHz.  The REAL bitch was the Ready circuitry when 
 +flipping between DMA/​6502/​Z80.
 +The C128 design team: SYS32800,​123,​45,​6
 +Bil Herd       ​Original design and Hardware team leader.
 +Dave Haynie ​   Integration,​ timing analysis, and all those dirty
 +              jobs involving computer analysis which was something
 +              totally new for CBM.
 +Frank Palaia ​  One of three people in the world who honestly knows
 +              how to make a Z80 and a 6502 live peacefully with
 +              each other in a synchronous,​ dual video controller,
 +              time sliced, DRAM based system.
 +Fred Bowen     ​Kernal and all system like things. ​ Dangerous when
 +              cornered. ​ Has been known to  brandish common sense
 +              when trapped.
 +Terry Ryan     ​Brought structure to Basic and got in trouble for
 +              it. Threatened with the loss of his job if he ever
 +              did anything that made as much sense again. ​ Has
 +              been know to use cynicism in ways that violate most
 +              Nuclear Ban Treaties.
 +Von Ertwine ​   CPM.  Sacrificed his family'​s popcorn maker in the
 +              search of a better machine.
 +Dave DiOrio ​   VIC chip mods and IC team leader. ​ Ruined the theory
 +              that most chip designers were from Pluto.
 +Victor ​        MMU integration. Caused much dissention by being one
 +              of the nicest guys you'd ever meet.
 +Greg Berlin ​   1571 Disk Drive design. ​ Originator of Berlin-Speak.
 +              I think of Greg every night. ​ He separated my
 +              shoulder in a friendly brawl in a bar parking lot
 +              and I still cant sleep on that side.
 +Dave Siracusa ​ 1571 Software. ​ Aka "The Butcher"​
 +Not to mention the 8563 designers who made this story possible.
 +.                                    C=H #17
 +       -fin-
magazines/chacking17.txt · Last modified: 2015-04-17 04:34 (external edit)