no way to compare when less than two revisions

Differences

This shows you the differences between two versions of the page.

@@ Line 1: / Line 1: @@
+<code>
+                   ########
+             ##################
+         ######            ######
+      #####
+    #####  ####  ####      ##      #####   ####  ####  ####  ####  ####   #####
+  #####    ##    ##      ####    ##   ##   ##  ###     ##    ####  ##   ##   ##
+ #####    ########     ##  ##   ##        #####       ##    ## ## ##   ##
+#####    ##    ##    ########  ##   ##   ##  ###     ##    ##  ####   ##   ##
+#####  ####  ####  ####  ####  #####   ####  ####  ####  ####  ####   ######
+#####                                                                    ##
+ ######            ######           Issue #19
+   ##################             May 29, 2000
+       ########			 (Memorial Day)
+...............................................................................
+			    Seek, and ye shall find.
+		           Ask, and it shall be given.
+...............................................................................
+BSOUT
+	C'mon, it's only, what, 9 months late?
+	Many of you, I am sure, have been wondering, "Is C=Hacking still
+alive?  Has he lost interest?"  The respective answers are yes, and no.
+				- BUT -
+	Although I have not lost interest in the 64, I have lost a lot of
+free time I once had, and I am now able to pursue a lot of other interests!
+So the total time allocated to the 64, and hence to C=Hacking, has
+decreased considerably.  Work on this issue actually began last summer,
+around August or September.  But work on jpx began about the same time,
+followed by work on Sirius, and I devoted my C64 time to them instead of
+C=Hacking.  Then work intensified at work, and work began on a garage, and
+a plane, and... well, you get the idea.  Poor issue #19 just got worked on
+in little dribbles every few weeks.
+	The main reason I share this sad tale is that, the way I see it,
+C=Hacking could use a little help, if it is to come out more frequently.
+If nobody volunteers it will still come out, but in exactly the way it
+does right now -- a little less frequently than it ought to.  Some of the
+more time-consuming tasks are: finding articles, reviewing (actually
+refereeing) articles, and collecting the latest news and tips.  Finding
+articles means finding people who are doing some nifty Commodore project,
+or talking someone into doing some nifty Commodore project.  Refereeing an
+article means reading the article carefully, making sure everything is
+technically correct, making suggestions for improvement, and so on.  And
+collecting news means being plugged into the system.
+	I have a few people I rely on for some of these things, but I
+could use more, and if you'd like to help out (especially finding new
+articles, or keeping up to date on the latest C64 news) please drop me
+an email.
+	With that out of the way, brother Judd would like to preach on
+a malaise that afflicts the C64 world and which has been getting worse:
+Not Finishing The Job.  I just think about all the promising projects
+I've heard about over the last few years -- off the top of my head I remember
+a SCPU game, a SCPU monitor, several demos, multiple utilities, a VDC code
+library, several OSes... -- which were Almost Done.  And where are they
+now?  Presumably, still Almost Done.  So if you have a project which is
+Almost Done, but has been sitting around for the last few months/years...
+please, please finish up that last 10% and release it.
+	We, the technical community, are a community.  We draw strength
+from each other, we get ideas and motivation from each other, and we
+push each other to do great things.  It's a big feedback loop, where
+activity stimulates more activity, and decreased activity begets yet less
+activity.  I suppose C=Hacking serves as a prime example of this.
+	I'm not saying we're on the verge of a big programming renassaince,
+but I am concerned that we are drying up.  Maybe if people finish up those
+programs lying around it will reverse the trend.  (I mean, hey, doesn't
+this finally finished-up issue want to make you go out and do cool stuff?)
+	In other news, The Wave seems to be testing out wonderfully and
+is totally cool.  In case you've been under a rock these past few months
+The Wave is an integrated TCP/IP suite for Wheels -- telnet, graphical
+web browser, PPP, the works.  Lots of people have been beta-testing it
+for several months now and it is solid.  Outstanding.
+	I was asked lo these many months ago to put in a plug for
+	http://www.6502.org
+which is run by Mike Naberezny (mnaberez@nyx.net).  He is looking for
+comments, suggestions, and maybe even contributions, so drop him a line
+and tell him what you think.
+	The ever-resourceful Pasi Ojala has several new thingies on his
+web site.  This is probably ancient history by now but it's in my "latest
+news" file, sooo...
+) a voice-only copy of the Amiga Expo 1988 presentation by R.J.Mical
+   about the early years of Amiga is available in four parts as .mp3
+   from http://www.cs.tut.fi/~albert/Dev/
+        (24kbit/s, 16kHz, mono, ~20MB total, over 100 minutes)
+   Includes facts and fiction and funny stories about the making of
+   the Amiga. The files may change location in the future but you
+   will find links to them from my page. Enjoy!
+) Some VIC20 graphics are also available at
+        http://www.cs.tut.fi/~albert/Dev/VicPic/
+   There is one picture which can be viewed with unexpanded VIC20
+   (with 154x/7x or 1581 drive) and others for 8k-expanded
+   machine. Both PAL and NTSC versions are available.
+   There are also gif version of the pictures on the page.
+	Myke Carter (mykec@delphi.com) has developed a filter program
+that allows C=Hacking to be converted to geoWrite format.  Thus, if
+you'd like a geoWrite version of C=Hacking, send him some email!
+Finally, this is memorial day here in the States, and I'd just like to
+suggest folks take a little time to think about the purpose of this holiday
+and why we have it.
+Okay then, enough with the jabber, and on to hacking excellence.
+.......
+....
+..
+.                                    C=H 19
+::::::::::::::::::::::::::::::::::: Contents ::::::::::::::::::::::::::::::::::
+BSOUT
+	o Voluminous ruminations from your unfettered editor.
+Jiffies
+	o Things.  And stuff.
+Side Hacking
+	o "Burst Fastloader for the C64", by Pasi Ojala <albert@cs.tut.fi>.
+	  The 128 can burst-load from devices such as the 1571 and 1581.
+	  With a small hardware modification, the C64 can too -- as it was
+	  originally designed for.  This article discusses the modification
+	  along with example burstload code.
+	o "8000's User Port & Centronics Printers", by Ken Ross
+	  <petlibrary@bigfoot.com>.  This article describes the user port
+	  on the PET 8000, including a demonstration BASIC program for
+	  sending data to e.g. a centronics printer via the user port.
+Main Articles
+	o "Sex, lies, and microkernal-based 65816 native OSes, part 1",
+	  by Jolse Maginnis <jmaginni@postoffice.utas.edu.au>.  It's time
+	  to learn about OS design and design philosophy.  This article
+	  starts with OS basics and ends with JOS innards.  (JOS, in case
+	  you've been under a rock the past few months, is a rather cool
+	  multitasking 65816 OS which can do some rather cool things).
+	o "VIC-20 Kernel ROM Disassembly Project", by Richard Cini
+	  <rcini@email.msn.com>
+	  And on we go to article three in the series.  This article continues
+	  the investigation of the IRQ and NMI routines -- specifically,
+	  the routines called by those routines (UDTIM, SCNKEY, etc.).
+	o "JPEG: Decoding and Rendering on a C64", by S. Judd <sjudd@ffd2.com>
+	  and Adrian Gonzalez <adrianglz@globalpc.net>.  Actually it's
+	  two articles:
+	  "Decoding JPEGs".  This article covers the basics and details of
+	  JPEG encoding and decoding, with special attention to the IDCT,
+	  and some related C64 issues.
+	  "Bringing 'true color' images to the 64".  This article discusses
+	  Floyd-Steinberg dithering, and how the IFLI graphics in jpz are
+	  rendered.
+.................................. Credits ...................................
+Editor, The Big Kahuna, The Car'a'carn..... Stephen L. Judd
+C=Hacking logo by.......................... Mark Lawrence
+Special thanks to the folks who have helped out with reviewing and such,
+and to the article authors for being patient!
+Legal disclaimer:
+) If you screw it up it's your own fault!
+) If you use someone's stuff without permission you're a dork!
+About the authors:
+Jolse Maginnis is a 20 year old programmer and web page designer,
+currently taking a break from CS studies.  He first came into contact
+with the C64 at just five or six years of age, when his parents brought
+home their "work" computer.  He started out playing games, then moved on
+to BASIC, and then on to ML.  He always wanted to be a demo coder, and in
+met up with a coder at a user's group meeting, and has since worked
+on a variety of projects from NTSC fixing to writing demo pages and intros
+and even a music collection.  JOS is taking up all his C64 time and he
+is otherwise playing/watching sports, out with his girlfriend, or at a
+movie or concert somewhere.  He'd just like to say that "everyone MUST
+buy a SuperCPU, it's the way of the future" and that if he can afford
+one, anyone can!
+Richard Cini is a 31 year old vice president of Congress Financial
+Corporation, and first became involved with Commodore 8-bits in 1981, when
+his parents bought him a VIC-20 as a birthday present.  Mostly he used it
+for general BASIC programming, with some ML later on, for projects such as
+controlling the lawn sprinkler system, and for a text-to-speech synthesyzer.
+All his CBM stuff is packed up right now, along with his other "classic"
+computers, including a PDP11/34 and a KIM-1.  In addition to collecting
+old computers Richard enjoys gardening, golf, and recently has gotten
+interested in robotics.  As to the C= community, he feels that it
+is unique in being fiercely loyal without being evangelical, unlike
+some other communities, while being extremely creative in making the
+best use out of the 64.
+Adrian Gonzalez is a 26 year old system/network administrator for an ISP
+serving Laredo, TX and Neuvo Laredo, Mexico.  He and his brother convinced
+their parents to buy them a C64 in 1984, and whereas his brother moved on
+to PCs he stuck with the 64 and later bought an Amiga.  He learned BASIC
+programming in sixth grade and wrote a few BASIC programs for the family
+business; since then Adrian has put several demos and utilities under his
+belt.  In addition to fancy graphics and music, Adrian has an interest
+in copy protection schemes (and playing the occasional game, of course).
+When he's not coding, he's either playing basketball, playing piano,
+editing videos, or going out to movies/parties.  You can visit his web
+page at http://starbase.globalpc.net/c64/main.html for more info.
+For information on the mailing list, ftp and web sites, send some email
+to chacking-info@jbrain.com.
+While http://www.ffd2.com/fridge/chacking is the main C=Hacking homepage,
+C=Hacking is available many other places including
+	http://www.funet.fi/pub/cbm/magazines/c=hacking/
+	http://metalab.unc.edu/pub/micro/commodore/magazines/c=hacking/
+................................... Jiffies ..................................
+$FFC6
+I actually have a little Jiffy that I 'discovered' recently.  It's one of
+those things that is so obvious and simple that it took me several tries
+before I stumbled onto it.  It also highlights a rather powerful feature
+of the lowly C64 kernal.
+Not long ago, I was asked to write a slideshow program for jpz.  Ideally,
+a slideshow program should be a "plug-in" for the regular viewer, which can
+load pictures from some list in a file.  But I didn't see a decent way to do
+this, especially for jpz which has maybe 200 bytes free total.  Then the
+thunderclap finally occured.
+Everyone has used CMD4 to redirect a file to the printer.  But just as the
+kernal can redirect _output_ to different devices, it can redirect the
+_input_ to be from different devices, using CHKIN.  So all the slideshow
+program has to do is open a list of filenames, redirect input to that file,
+and execute the normal jpz.  jpz just uses JSR CHRIN to get data -- normally
+that data comes from the keyboard, but with CHKIN it comes from the file
+instead, akin to "a.out < input" in unix.  Since jpz doesn't close the file,
+calling jpz repetitively will keep reading from the input file.
+	The result is a simple and effective slideshow program, and a trick
+which ought to be useful in other situations.  Here is the entire slideshow
+code, located at $02ae to be autobooting.  The main loop is seven lines long:
+*
+* Simple slideshow -- slj 4/2000
+*
+         org $02ae
+name     txt 'ssw.files'
+start
+         lda #start-name
+         ldx #<name
+         ldy #>name
+         jsr $ffbd
+         lda #3
+         tay
+         ldx $ba
+         jsr $ffba
+         jsr $ffc0
+         ldx #<main		;Modify JPZ to jump to main instead
+         ldy #>main		;of exiting
+         lda $10fb		;Check if jpy or jpz is in memory
+         cmp #$4c
+         bne :jpy
+         stx $10fc
+         sty $10fd
+         beq main
+:jpy     stx $10ed
+         sty $10ee
+main
+         ldx #3
+         jsr $ffc6
+         jsr $ffe4
+         lda $90		;loop until EOF reached
+         and #$40
+         bne :done
+         jmp $1000		;call jpz
+:done
+         lda #3
+         jsr $ffc3
+         jsr $ffcc
+         jmp $a474
+         da start
+         da start
+................................ Side Hacking ................................
+Burst Fastloader for C64 by Pasi Ojala, albert@cs.tut.fi
+------------------------
+   Commodore disk drives 1570/71 and 1581 implemented a new fast serial
+   protocol to be used with the C128 computer. This synchronous serial
+   protocol speeds up data transfer between the computer and the drive
+   ten-fold. The amazing thing is that this kind of serial protocol was
+   supposed to be used in VIC-20 and the 1540 drive until it was
+   discovered that a hardware bug in the 6522 VIA (versatile interface
+   adapter) chip prevented the use of the chip's synchronous serial
+   interface.
+   The synchronous serial port would've allowed whole bytes to be sent in
+   both directions without processor intervention with the maximum speed
+   of one bit per two clock cycles. Without a bug-free synchronous serial
+   port the transfer had to be slowed down considerably so that the
+   receiver has a chance to detect all changes in the serial bus lines.
+   This became the dead slow software-driven Commodore serial protocol.
+  Syncronous Serial
+   The complex interface adapter (6526 CIA) chips used in Commodore 64
+   and later in Commodore 128 have bug-free synchronous serial
+   interfaces: serial data and serial clock inputs/outputs. In input
+   mode, each time a rising edge is detected in the serial clock pin
+   (CNT), the state of the serial data (SP) is shifted into a register.
+   When 8 bits are received the accumulated bits are moved into the
+   serial data register and a bit is set in the interrupt status register
+   to reflect this. If the corresponding interrupt is enabled, an
+   interrupt is generated.
+   In output mode the serial clock line is controlled by Timer A. The
+   serial clock is derived from the timer underflow pulses. When a byte
+   is written to the serial data register, the value is clocked out
+   through the serial data pin (SP) and the corresponding clock signal
+   appears on the serial clock pin (CNT). After all 8 bits are sent, the
+   serial interrupt bit is set in the interrupt status register.
+   Synchronous serial bus is used in C128/157x/1581 fast serial protocol.
+   An obsolete signal in the peripheral serial bus (SRQ) was taken into
+   service as the new fast (synchronous) serial clock line. The old
+   serial data line doubles as slow and fast serial data line. And the
+   old serial clock line doubles as slow serial clock line and fast
+   serial (byte) acknowledge line.
+   The fast serial protocol is basically very simple. The side sending
+   data configures its synchronous serial port into output mode, the
+   other side uses input mode. The old peripheral serial bus clock line
+   is controlled by the receiving side and is used as an acknowledge:
+   when the receiver is ready for data, it toggles the state of the clock
+   line. The actual data is transferred using the synchronous serial
+   ports. The sender writes the data to be sent into the serial data
+   register and waits for the transfer to complete. The receiver waits
+   for a byte to arrive into its serial data register. The actual
+   transfer is automatically handled by the hardware.
+   Both the drive and the computer must detect whether the other side can
+   handle fast serial transfers. This is accomplished by sending a byte
+   using the synchronous serial port while doing handshaking. The drive
+   sends a fast serial byte when the computer sends a secondary address
+   (SECOND, which is called by e.g. CHKOUT), the computer can in practice
+   send the fast serial byte anytime after the drive is reset and before
+   the drive would send fast serial bytes.
+  Modification to c64
+   To use burst fastloader with C64 we need to connect the CIA
+   synchronous serial port to the synchronous serial lines of the
+   Commodore peripheral serial bus. Two wires are needed: one to connect
+   the serial bus data line to the syncronous serial port data line and
+   one to connect the serial bus SRQ (the obsolete line for service
+   request, now fast serial clock) to the synchronous serial port clock
+   line. Select the right connections depending on whether you want to
+   use CIA1 or CIA2.
+/1,1581                             C64
+Pin1    SRQ     Fast serial bus clk             CNT1/2  User port 4/6
+Pin5    DATA    Data - slow&fast bus            SP1/2   User port 5/7
+Top view - old c64, CIA1
+User port       Cass port       Serial connector
+||||||||||||    ||||||           HHHHH          behind:
+||||||||||||    ||||||         .-1 3 5-.
+       ||______________________|  2 4  |          / \
+       |        CNT1               6   |         // \\
+       |_______________________________|         |||||
+                SP1                             1 264 5
+Top view - old c64, CIA2
+User port       Cass port       Serial connector
+||||||||||||    ||||||           HHHHH          behind:
+||||||||||||    ||||||         .-1 3 5-.
+     ||________________________|  2 4  |          / \
+     |  CNT2                       6   |         // \\
+     |_________________________________|         |||||
+                SP2                             1 264 5
+   Solder the wires either to the resistor pack or directly to the user
+   port connector, but remember to leave the outer half of the connector
+   free so that you can still plug in your user port devices.
+   Then solder the other ends to the serial connector. Those left- and
+   rightmost pins are 1 and 5, respectively, so it is fairly easy to do
+   the soldering. You can also build a cable which connects those lines
+   externally.
+  Software for C64
+   Of course the C64 only uses the standard slow serial routines and we
+   need a seperate fastloader routine to take advantage of the fast
+   serial connection we just soldered into our machine. The following
+   load routine is located in the unused area $2a7-$2ff and in the
+   cassette buffer $334-$3ff. Just load and run the "burster" program. It
+   installs the loader and replaces the default load routine by our
+   routine. The old load routine is used if
+     * a verify operation is requested
+     * a directory load operation is requested (filename starts with '$')
+     * the filename starts with a colon (':')
+   So, it is possible to use the old load routine by prepending a colon
+   (':') to the filename. This is needed if you need to use both fast and
+   slow serial devices at the same time. Unfortunately detecting
+   fast-serial-capable devices is not feasible, because a lot of ROM code
+   would have to be duplicated and then the loader would become too
+   large. Because of this it becomes the responsibility of the user to
+   prepend the colon (':') if a slow serial device is accessed.
+   A fastloader version is available for both CIA1 (asm, exe) and CIA2
+   (asm, exe) versions, uuencoded versions are attached to this article.
+   Only the CIA1 version is discussed here.
+; DASM V2.12.04 source
+;
+; Burst loader routine, minimal version to allow loading of programs upto 63k
+; in length ($400-$ffff). Directory is loaded with the normal load routine.
+;
+; (c)1987-98 Pasi Ojala, Use where you want, but please give me some credit
+;
+; This program needs SRQ to be connected to CNT1 and DATA to SP1 (CIA1).
+; Cassette drive won't work with those wires connected if the disk drive
+; is turned on. (SRQ is connected to cassette read line.)
+;
+; SRQ = Bidirectional fast clock line for fast serial bus
+; DATA= Slow/Fast serial data (software clocked in slow mode)
+;
+; In C128D (64-mode) you should use CIA2, because it has special hardware
+; which inhibits the use of CIA1 (or so I'm told).
+;
+; A short description of the burst protocol and commands can be found
+; from the "1581 Disk Drive User's Guide".
+        processor 6502
+        ORG $0801
+        DC.B $b,8,$ef,0 ; '239 SYS2061'
+        DC.B $9e,$32,$30,$36,$31
+        DC.B 0,0,0
+install:
+        ; copy first block to $2a7..$2ff
+        ldx #block1_end-block1-1        ; Max $58
+$      lda block1,x
+        sta _block1,x
+        dex
+        bpl 0$
+        ; copy second block to $334..$3ff
+        ldx #block2_end-block2          ; Max $cc
+$      lda block2-1,x
+        sta _block2-1,x
+        dex
+        bne 1$
+        lda $0330       ; load vector
+        ldx $0331
+        cmp #MyLoad
+        beq 3$
+$      sta OldVrfy+1   ; chain the old load vector
+        stx OldVrfy+2
+        lda #MyLoad
+        sta $0331
+$      rts
+block1
+#rorg $02a7
+_block1
+OldLoad lda #0
+OldVrfy jmp $f4a5       ; The 'normal' load.
+MyLoad: ;sta $93
+        cmp #0          ; Is it a prg-load-operation ?
+        bne OldVrfy     ; If not, use the normal routine
+        stx $ae         ; Store the load address
+        sty $af
+        tay             ; ldy #0
+        lda ($bb),y     ; Get the first char from filename
+        ldy $af
+        cmp #$24        ; Do we want a directory ($) ?
+        beq OldLoad     ; Use the old routine if directory
+        cmp #58         ; ':'
+        beq OldLoad
+        ; Activate Burst, the drive then knows we can handle it
+        sei             ; We are polling the serial reg. intr. bit
+        ldy #1          ; Set the clock rate to the fastest possible
+        sty $dc04
+        dey             ; = ldy #0
+        sty $dc05
+        lda #$c1
+        sta $dc0e       ; Start TimerA, Serial Out, TOD 50Hz
+        bit $dc0d       ; Clear interrupt register
+        lda #8          ; Data to be sent, and interrupt mask
+        sta $dc0c       ; (actually we just wake up the other end,
+$      bit $dc0d       ;  so that it believes that we can do
+                        ;  burst transfers, data can be anything)
+        beq 0$          ; Then we poll the serial (data sent)
+        ; Clears the interrupt status
+        ; This program assumes you don't try to use it on a 1541
+        ; If you try anyway, your machine will probably lock up..
+        lda #$25        ; Set the normal (PAL) frequence to TimerA
+        sta $dc04       ; Change if you want to preserve NTSC-rate
+        lda #$40
+        sta $dc05
+        lda #$81
+        jmp LoadFile
+GetByte lda #8          ; Interrupt mask for Serial Port
+$      bit $dc0d       ; Wait for a byte
+        beq 0$          ;  (Serial port int. bit changes, hopefully)
+        ;ldy $dc0c      ; Get the byte from Serial Port Register
+ToggleClk:
+        lda $dd00       ; Toggle the old serial clock (=send Ack)
+        eor #$10        ;  so that the disk  drive will start
+        sta $dd00       ;  sending the next byte immediately
+        ;tya            ; return the value in Accumulator, update flags
+        lda $dc0c       ; Get the byte from Serial Port Register
+        rts
+#rend
+block1_end
+block2
+#rorg $0334
+_block2
+LoadFile:
+        sta $dc0e       ; Start TimerA, Serial IN, TOD 50Hz (PAL)
+        ;cli
+        jsr $f5af       ; searching for ..
+        lda $b7         ; Preserve the filename length
+        pha
+        lda $b9         ; Do the same with secondary address
+        sta $a5         ; We store it to cassette sync countdown..
+                        ;  No cassette routines are used anyway, as
+        lda #0          ;  this prg is in cassette buffer..
+        sta $b7         ; No filename for command channel
+        lda #15
+        sta $b9         ; Secondary address 15 == command channel
+        lda #239
+        sta $b8         ; Logical file number (15 might be in use?)
+        jsr $ffc0       ; OPEN
+        sta ErrNo+1
+        pla
+        sta $b7         ; Restore filename length
+        bcs ErrNo       ; "device not present",
+                        ; "too many open files" or "file already open"
+        ; Send Burst command for Fastload
+        ldx #239
+        jsr $ffc9       ; CHKOUT Set command channel as output
+        sta ErrNo+1
+        bcs NoDev       ; "device not present" or other errors
+        ; Bummer, the interrupt status register bit indicating fast serial
+        ; will be cleared when we get here..
+        ldy #3
+$      lda BCMD-1,y    ; Burst Fastload command
+        jsr $ffd2
+        dey
+        bne 3$
+        ; ldy #0
+$      lda ($bb),y
+        jsr $ffd2       ; Send the filename byte by byte
+        iny
+        cpy $b7         ; Length of filename
+        bne 1$
+        jsr $ffcc       ; Clear channels
+        sei
+        jsr $ee85       ; Set serial clock on == clk line low
+        bit $dc0d       ; Clear intr. register
+        jsr ToggleClk   ; Toggle clk
+        jsr HandleStat  ; Get Initial status
+        pha             ; Store the Status
+        ;jsr $f5d2      ; loading/verifying
+        ; (uses CHROUT, which does CLI, so we can't use it)
+; We could add a check here..
+; if we don't have at least two bytes, we cannot read load address..
+; It seems that for files shorter than 252 bytes the 1581 does not count
+; the loading address into the block size.
+        jsr GetByte     ; Get the load address (low) - We assume
+                        ; that every file is at least 2 bytes long
+        tax
+        jsr GetByte     ; Get the load address (high)
+        tay             ; already in Y
+        lda $a5         ; The secondary address - do we use load
+                        ;  address in the file or the one given to
+        bne Our         ;  us by the caller ?
+        stx $ae         ; We use file's load addr. -> store it.
+        sty $af
+Our     ldx #252        ; We have 252 bytes left in this block
+        pla             ; Restore the Status
+        bne Last        ; If not OK, it has to be bytes left
+Loop    jsr GetAndStore ; Get X bytes and save them
+        jsr HandleStat  ; Handle status byte
+        beq Loop        ; If all was OK, loop..
+Last    tax             ; Otherwise it is bytes left. Do the last..
+        jsr GetAndStore ; Get X number of bytes and save them
+        jsr $ee85       ; Serial clock on (the normal value)
+        lda #239
+        jsr $ffc3       ; Close the command channel
+        clc             ; carry clear -> no error indicator
+        bcc End
+FileNotFound:
+        pla             ; Pop the return address
+        pla
+        jsr $ee85       ; Serial clock on (the normal value)
+        lda #4          ; File not found
+        sta ErrNo+1
+NoDev   lda #239
+        jsr $ffc3       ; Close the command channel
+ErrNo   lda #5          ; Device not present
+        sec             ; carry set -> error indicator
+End     ldx $ae         ; Loader returns the end address,
+        ldy $af         ;  so get it into regs..
+        cli
+        rts             ; Return from the loader
+HandleStat:
+        jsr GetByte     ; Get a byte (and toggle clk to start the
+                        ;  transfer for next byte)
+        cmp #$1f        ; EOI ?
+        bne 0$
+        jmp GetByte     ; Get the number of bytes to follow and RTS
+$      cmp #2          ; File Not Found ?
+        bcs FileNotFound        ; file not found or read error
+        ; code 0 or 1 -> OK
+        ldx #254        ; So, the whole block is coming
+        lda #0          ; No error -> Z set
+        rts
+GetAndStore:
+        jsr GetByte     ; Get a byte & toggle clk
+        ;sta $d020
+        ldy #$34
+        sty 1           ; ROMs/IO off (hopefully no NMI:s occur..)
+        ldy #0
+        sta ($ae),y     ; Store the byte
+        ldy #$37
+        sty 1           ; Restore ROMs/IO (Should preserve the
+                        ;  state, but here it doesn't..)
+        inc $ae         ; Increase the address
+        bne 0$
+        inc $af
+$      dex             ; X= number of bytes to receive
+        bne GetAndStore
+        rts
+BCMD:   dc.b $1f, $30, $55      ; 'U0',$1F == Burst Fastload command
+                                ; If $9F, Doesn't have to be a prg-file
+#rend
+block2_end
+   Now that was it. Now I just hold back and wait until someone
+   implements this for VIC-20's buggy 6522 chips so that I don't have
+   to.. :-)
+begin 644 burster-cia1
+M`0@+".\`GC(P-C$```"B5[U"")VG`LH0]Z+'O9D(G3,#RM#WK3`#KC$#R:S0[
+M!.`"\!"-J@*.JP*IK(TP`ZD"C3$#8*D`3*7TR0#0^8:NA*^HL;NDK\DD\.K)Y
+M.O#F>*`!C`3QNR#2_\C$M]#V(,S_R
+M>""%[BP-W"#S`B#,`T@@[`*J(.P"J*6ET`2&KH2OHOQHT`@@WP,@S`/P^*H@@
+MWP,@A>ZI[R##_QB0$FAH((7NJ02-Q`.I[R##_ZD%.*:NI*]88"#L`LD?T`-,A
+G[`+)`K#:HOZI`&`@[`*@-(0!H`"1KJ`WA`'FKM`"YJ_*T.A@'S!5/
+``
+end
+size 354
+begin 644 burster-cia2
+M`0@+".\`GC(P-C$```"B2[U"")VG`LH0]Z+)O8T(G3,#RM#WK3`#KC$#R:S0E
+M!.`"\!"-J@*.JP*IK(TP`ZD"C3$#8*D`3*7TR0#0^8:NA*^HL;NDK\DD\.K)Y
+M.O#F>*`!C`3=B(P%W:G!C0[=+`W=J0B-#-TL#=WP^TPT`ZD(+`W=\/NM`-U)T
+M$(T`W:T,W6"I@(T.W2"O]:6W2*6YA:6I`(6WJ0^%N:GOA;@@P/^-Q@-HA;>PZ
+M:Z+O(,G_CZI!(W&`ZGO(,/_J04XIJZDKUA@(.`"R1_0`TS@`LD"L-JB_JD`H
+=8"#@`J`TA`&@`)&NH#>$`>:NT`+FK\K0Z&`?,%4"Y
+``
+end
+size 344
+::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
+'s USER PORT & CENTRONICS PRINTERS
+by Ken Ross
+petlibrary@bigfoot.com
+http://members.tripod.com/~petlibrary
+A recent query had me digging out an old item dealing with the user port on
+the CBM/PETs.  The main use I've put it to in the past has been to drive a
+parallel printer with just the addition of a home brew cable (a Panasonic
+Daisy Wheel printer salvaged before bin men got it!).  The user port is
+the edge connection tween the IEEE edge and the cassette#1.  The top side
+is mostly diagnostic, the underside is the easy to use area.  It's an I/O
+(Input/ Output) system that you can control with a few PEEKs and POKEs.
+Reading from left to right (as you look at the back of the beastie):
+A _ ground
+B _ input to 6522 VIA, CA1
+C D E F G H J K L _ are  I/O lines ( 8 of them ) , PA0-7 [ data lines ]
+M _ CB2 line from VIA can be I/O
+N _ ground
+A text file to be printed out can be read a character at a time with
+MID$(etc) for this PRG to deal with and quite high speeds can be reached
+even without having to compile it .
+(This is actually a section of listing just printed out from my 8096 -
+hence untidy numbers )
+POKE 59459, 255:REM make PA0-7 into outputs
+POKE 59467,PEEK(59467) AND 277 :REM disable shift register
+RETURN :REM finished with this sub
+     [this enables the user port for this purpose]
+REM this sub puts the data into output
+if DATA <32 then goto 3080 :REM line does biz for LF & CR
+if DATA =>65 and DATA<= 90 then DATA=DATA +32 : goto 3029
+     [petscii lower case is chr$(65-90) but ascii uses 97-122]
+if DATA =>193 and DATA<= 218 then DATA=DATA -128 :goto 3029
+     [petscii upper case is chr$(193-218) which has to be shifted to
+      ascii 65-90]
+     [ascii uses up to 127 but petscii uses up to 255 for chars]
+REM line below sets strobe low to inform printer new data character on
+way
+POKE 59468, PEEK(59468) AND 31 OR 192
+REM below sets strobe high as data arrives
+POKE 59468,PEEK(59468) AND 31 OR 224
+POKE 59471, DATA:REM at last data is POKE'd !!!
+     [the data numbers from above]
+POKE 59468,PEEK(59468) AND 31 OR 224 :REM strobe high still
+REM handshake sub
+  POKE 59467, PEEK(59467) OR 1
+WAIT 59469,2
+K=PEEK(59457)
+REM end of handshake sub
+     [well it works for me!!]
+RETURN :REM back to main area for next data
+REM bit for LF & CR sub & return
+     [this depends on the printer and the same procedure for paper eject
+      if needed]
+The cable connections are
+CBM	- CENTRONICS
+CB2     - DATA STROBE   #1
+PA0~7   - DATA1-8       #2-9
+CA1     - ACKNOWLEDGE   #10 ( or BUSY #11 depending on printer ! )
+GND	- grounds #14, 16, 24, 33, chassis gnd 17
+More modern printers will also need additional commands to enable things.
+The commands needed for Epson printers ( with the exception list of
+Epsons that don't use them !) are on my website at :
+	http://members.tripod.com/~petlibrary/printesc.htm
+If any more info turns up it'll be there in time .
+.......
+....
+..
+.                                    C=H 19
+::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
+				 Main Articles
+::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
+	  ------------------------------------------------------------
+         | Sex, lies and microkernel based 65816 native OSes. - Part 1|
+	  ------------------------------------------------------------
+			      By Jolse Maginnis
+Some readers may have read my article in GO64 issue 8/1999, which was a bit of
+an introduction to JOS and some Operating System concepts, but it wasn't very
+technical, and didn't really get into the nitty gritty. Getting down and dirty
+with the bits and bytes is what C-Hacking is all about, so that's what this
+series of articles will try to do wherever possible.
+I'll try to go into detail about modern OS designs, paying particular detail to
+what is relevant to the C64/SuperCPU and what we can do without. I'll also try
+and make comparisons to the kind of coding most of us are used to, e.g. just
+using the kernel to access hardware, or just skipping the kernel altogether.
+Most of the article will be in reference to the SuperCPU, specifically it's
+CPU, and the OS I'm making for it, called JOS. If you haven't got a
+SuperCPU yet, hopefully you'll want one by the end! (Remember it won't stop you
+running stock programs!)
+	      -------------------------------------------------
+	     | OK, So what do you plan to do.. And why bother? |
+	      -------------------------------------------------
+When I first heard about the SuperCPU, I got pretty excited. "20Mhz! That's 20
+times faster! 16Mbs! That's 256 times more RAM! I can only imagine what it's
+capable of!", well I didn't actually say those things, but I at least thought
+them! At the time I had already started making an OS for the C64, and at the
+time I didn't know much at all about making an OS, all I knew about was
+multitasking, and how to do it on C64. After that day, I decided I'd wait until
+I managed to get myself a SuperCPU and make an OS on that, and to my surprise,
+at that time, there didn't seem to be anyone else developing an OS for the
+SuperCPU.
+Only when the SCPU arrived and I had started coding for it, did I realise how
+powerful it was. Yeah it's 20 times faster in clock speed, but it's also a 16
+bit processor, which might not seem like a great step up, but once you start
+coding in 16 bits, it's hard to see how you did without it!
+The 65816 has some great advantages over the 6502:
+It's stack pointer is not limited to 256 bytes.
+The Zero Page isn't stuck in the zero page! (It's now called the Direct Page).
+There are a few more ways to put values on the stack.
+Long addressing allowing upto 16mb directly accessible memory.
+Plenty more..
+The top three things in particular, together with the 16 bit wide registers
+means it's very suited to programming in a high level language like C,
+particularly when compared to code that has to be produced for 6502. Higher
+level languages can actually use the real CPU stack rather than having to
+simulate it, as with 6502. Also by moving the Direct Page register, local
+variables can be accessed like zero page variables, so performance isn't hurt
+too much.
+All this would be good even at a lower speed like 1 or 2Mhzs, but it's at 20!
+The SuperCPU adds some real power to your old C64, but it's all hidden away
+because we're running a ~20 year old "OS". It's just crying out for a new one!
+The C64 has many limitations, most of which are provided by the kernel and the
+CBM serial bus. Here's a list of the main limits:
+Single Tasking - Running two seperate programs at the same time impossible.
+Some devices aren't catered for - Some devices don't have a chance at running
+with old programs that were designed before their time.
+Old sequential filesystem - It's not designed for random access files, although
+random access is possible, it's just slower. All C64 programs have to written,
+so that files are read from the beginning to the end, which is a little bit
+limiting. Also it's the drives that dictate the filesystem, so we aren't just
+stuck with the kernel's limits, we're stuck with the drives' as well. Having
+several files open on many drives, while reading and writing to all of them just
+isn't a possibility. Why would you want to do that? If you we're multitasking
+several programs, that's just might be what happens!
+It became pretty clear that the C64's kernel was of no use to JOS, since it had
+too many limitations. So everything had to be re-written from scratch, with the
+limits removed.
+Along with re-doing the filesystem and adding multitasking, I had some other
+plans for JOS:
+Networking - Everything is internet, internet, internet these days, and why not,
+the internet is great! So TCP/IP and SLIP/PPP were high on the list of TODO's.
+GUI - The SuperCPU is ideal for a nice, flexible, easy to program GUI.
+Console - I wanted the console to be as close as possible to one of the standard
+terminals (vt100,ansi etc..) thus making it easy to get by without needing a
+terminal emulation program.
+Shared libraries & shared code, relocatable binary format - Sharing as much code
+as possible really saves memory and loading time. The binary format means that
+you don't have to worry about where in memory your program will be.
+Modular and scalable - It's nice to be able to choose exactly what your OS
+needs, rather than getting lumped with it all. E.g. Do you really need tcp/ip
+loaded if your not going to use the internet? If i'm running a webserver, do I
+really need the console driver loaded?
+Device independence - Application should not have to worry at all about what
+devices they are using, which means that they'll be compatible any device
+including new ones. This is particularly useful when it comes to disk drives and
+filesystems.
+Porting and writing C programs - Wouldn't it be great if our C64's could take
+advantage of the Open Source movement that's sweeping the world, and compile
+some of these open source programs?
+OK, so why am I bothering? At first I just wanted to see what I could do with
+it, but now that it's come so far, it's not only of interest to me, as it's
+become a very powerful OS.
+	      -------------------------------------------------
+	     |             Bloat: My layers theory             |
+	      -------------------------------------------------
+Unless you've been living on a remote desert island for the last 5 years, you'll
+know about the terrible trend in personal computing these days; buy a new PC now
+and in 6 months or less it's outdated. As CBM users, we successfully avoid all
+this. Sure, CMD have tonnes of upgrades available, but they're all "once in a
+lifetime" upgrades, I'm pretty sure I wont be upgrading my SuperCPU!
+Have you ever thought about why PC's become outdated so quickly? It's very
+popular to blame Microsoft (and I will!), since they are the main proponent of
+bloat with their ever expanding OSes and applications, but it's just generally
+accepted now that it's ok to leave things unoptimized, and just add more and
+more "layers". I run Linux on my 486 PC, with 10mb of RAM, and it's unbelievable
+how much time is spent "chunking" or "thrashing", due to programs and their
+components taking up so much RAM. For me, it's all about layers. It's what
+separates C64's from the bloated world of the PC. Here's my comparisons...
+CPU Type
+--------
+PC           - 32 bit processors
+C64/SuperCPU - 8/16 bit Processor
+This is quite arguable, but when most of your code doesn't deal with numbers
+over 32768, 32 bit's can be a bit wasteful, but of course if you need to do 32
+bit arithmetic on an 8 or 16 bit processor, that too is wasteful. For me a 16
+bit processor is the ideal size, particularly after doing lots of 8 bit coding.
+Language used
+-------------
+PC           - Mainly C, C++
+C64/SuperCPU - Just about everything in Assembler
+C can be a thin layer or a thick layer, depending on the processor. On 6502 it's
+quite a thick layer, which is why most things for C64 were written in ASM. On
+, that layer isn't so thick, so it's a much more viable alternative.
+Although, when you write in a higher level language, you tend to forget about
+the actual code it produces, and don't bother optimizing it. C++ adds another
+layer onto C, not only because of the code it produces, but the style of
+program. Good object oriented programming practice adds extra bloat, because
+there is more emphasis on doing function calls, to do things that ordinarily are
+done by directly accessing the data. The real bloat of Object Orientation isn't
+actually the code that you write yourself, you can still write optomized code in
+an OO language, but the bloat is in the libraries of objects that you use when
+writing your application, take a look at JAVA's huge object libraries for
+example.
+OS type
+-------
+PC           - Multitasking OS
+C64/SuperCPU - Kernel, or no OS at all.
+A multitasking OS adds some layers by default, since it has to switch between
+processes. The OS isn't just the task switcher however, it's everything that's
+needed to run applications, such as device drivers and shared libraries. In my
+opinion, absolutely none or as little as possible of the OS should be written in
+a high level language, since it's going to be used by every application, and you
+want frequently used things to be as optimized as possible. Most definitely the
+most useful task an OS can provide is doing all the Disk I/O. Unfortunately for
+us, the C64's kernel and CBM's serial bus are no where near fast enough, so
+coders made their own DOS routines.
+User Interface
+--------------
+PC          - Windows, X Windows
+C64         - BASIC, GEOS
+Windows and X are the most popular GUI's going around. X doesn't impose any
+standards on applications, they are free to use whatever widget toolkits they
+want, and usually do! When you have a few different applications running, each
+with it's own GUI toolkit, you soon run out of memory, particularly if they're
+big bloated C++ toolkits. Windows isn't quite the same, you at least have a
+consistent look and feel, which also adds up to less memory wastage because most
+apps use the same code. GEOS is nice looking but isn't very flexible at all, but
+this does mean that it's a very thin layer. My hope is to achieve a balance
+between the two.
+So why'd I bother with all that? Well I just want to hilight that JOS will be
+taking all those things into account, and I want to minimize the amount and size
+of layers being added to our beloved C64's.
+	      -------------------------------------------------
+	     | Monolithic or Micro? How do we want our kernel? |
+	      -------------------------------------------------
+There are two main styles of OSes doing the rounds at the moment, both with
+their own good and bad points.
+Monolithic kernels
+------------------
+These, as the name suggests, are one large monolith of code, which usually
+contain driver code for all devices. You would definitely consider the C64's
+kernel as a monolithic kernel. Multitasking kernels sometimes allow
+modularization, which is basically very similar to what a microkernel does, by
+allowing parts of the OS to be dynamically loaded. Linux is a very popular
+example of this. It's a monolithic kernel which allows kernel modules to be
+loaded dynamically. Last time I checked Lunix Next Generation worked along these
+lines.
+Good    - Generally a little faster than Microkernels, particularly if the time
+	  taken to switch processes is slow.
+Bad     - Not as scalable as a Microkernel. You get everything in a big chunk,
+	  whether you need it or not.
+How     - Generally applications need to make calls to a jump table, which
+	  usually will point to routines for Opening, Closing, Reading and
+	  Writing devices.
+          e.g.
+                lda #'a'
+	  	jsr $ffd2
+	  Prints 'a' character to the current file/device.
+Microkernel
+-----------
+Microkernels truly are micro in size, if they're done correctly. Rather than
+lump all the device driver and API code in together, Microkernels only provide
+very simple services for setting up processes and allowing them to communicate
+with each other. All the device drivers and file-systems are then supplied by
+optional programs that are loaded dynamically at run time. This allows maximum
+scalability, as you simply don't have to load parts of the OS that you don't
+need. The best example other than JOS would be QNX (http://www.qnx.com), a UNIX
+based Microkernel OS, which is extremely scalable and very small in code size.
+On 6502/C64, OS/A65 is another Microkernel OS.
+Microkernel OSes rely heavily on fast Inter Process Communication (IPC). Luckily
+this is quite easy to achieve on 65816, and is basically a matter of passing
+pointers between processes.
+Good    - Extremely scalable. Nicely split up into easy managle parts. Easier to
+          debug. I chose a Microkernel in JOS for these reasons.
+Bad     - Can be slower if too much time is spent switching between processes.
+How     - A jump table is still used, but to actually do any I/O you need to
+          communicate with the server process via IPC.
+	  To do this in JOS it involves setting up a message somewhere in memory
+	  and then calling the S_send system call, to send to the server
+	  process. Usually the message will be put on the stack and then popped
+	  off when returned, much like a C function call.
+	  e.g. to open the file "hello.txt" for reading
+		pea O_READ         ; flags
+	  	pea ^hellostr      ; high byte
+		pea !hellostr      ; low word
+                pea IO_OPEN        ; Message code
+		tsc
+		inc
+		tax                ; Low word of Message = Stack+1
+		ldy #0		   ; Stack is in Bank 0
+		lda #Channel	   ; Channel where "hello.txt" is.
+		jsr @S_send
+		tsc
+		clc
+		adc #8
+		tcs
+	 hellostr .asc "hello.txt",0
+         note: These are 65816 instructions, so if you don't know what they do
+	 you better look them up! The '@' symbol is used to force long
+	 addressing, '^' is used for the high 8 bits of a 24bit address, and '!'
+	 is used as the bottom 16 bits.  Note that pea is a 16-bit instruction,
+	 so pea ^hellostr will add an extra 00 byte.
+	 The first 4 pea's prepare an 8 byte filesystem message, containing:
+	 Message code for an Open:	IO_OPEN
+bit Pointer to Filename:	hellostr
+	 Open flags for reading:	O_READ
+	 This message is passed to the filesystem using one of JOS's Inter
+	 Process Communication (IPC) system calls, S_send. This call takes the
+bit address of the message in X/Y, and the IPC channel for which to
+	 send the message to, in the A register. Every system call in JOS
+	 assumes 16 bit A/X/Y registers, as there really isn't anything to be
+	 gained by switching to 8 bits for things that only need 8 bits. Adding
+to the stack pointer at the end "pops" the message back off the
+	 stack.
+	 This all looks a bit complicated doesn't it? Which is where shared
+	 libraries help out. The standard C library for JOS allows you to do I/O
+	 and such without actually worrying about the system calls. Yes it is a
+	 "layer", but it's a very thin one, since the library is written in ASM.
+	       pea O_READ	; same as the c code: open("hello.txt",O_READ);
+	       pea ^hellostr
+	       pea !hellostr
+	       jsr @_open
+	       pla
+	       pla
+	       pla
+	 Much simpler right?
+	 Compare that with the C64 kernel equivalent of:
+	       lda #namelen
+	       ldx #<hellostr
+	       ldy #>hellostr
+	       jsr $ffbd	; SETNAM
+	       lda #1
+	       ldx #8
+	       ldy #1
+	       jsr $ffba	; SETLFS
+	       jsr $ffc0	; OPEN
+	 Notice that the JOS version doesn't worry about device numbers or
+	 anything.. I'll get to that later...
+	         ---------------------------------------------
+	        |       C isn't just the letter after B       |
+	         ---------------------------------------------
+Before I get into juicy OS details, I should explain about C and the standard C
+library, as I'll be mentioning it quite a bit.
+C is a very powerful language that was created by the same people who created
+UNIX, so the two really go hand in hand. The majority of applications written
+for UNIX type OSes are written in C; in fact, rather than give you executable
+files, they are normally distrubuted as C source code, that you have to compile
+yourself. Why is it used so much? Well if the only high level language you've
+seen is BASIC, then you'd wonder how any high level language could be used for
+good quality programs. C is different because it's just about as close as
+you can get to programming in assembly without actually doing it, particularly
+on newer processors. It isn't quite so pretty on 6502, but it's quite good on
+the 65816.
+In BASIC you're used to having "built in" commands that will print to the
+screen, and commands for opening files and reading input, and any other I/O
+you can think of. But C on the other hand, has nothing "built in", it doesn't
+even have much of a notion of strings! Strings are just pointers to null
+terminated arrays of characters in C. So how do you actually get C to do
+anything useful? i.e. do some I/O?
+This is where the C standard library comes in. This library contains functions
+that deal with the underlying OS, and in particular opening/closing &
+reading/writing files. It also has code for dealing with strings, allocating
+memory, reading directories and various other useful functions. The standard
+library also contains more UNIX orientated functions, for dealing with OS
+features such as IPC and process control (more on processes later).
+JOS implements a large section of the standard C library, in particular the
+section that most command line applications will use. It does implement some of
+the UNIX specific functions, but not in a compatible way, and programs that use
+these functions are likely to be system applications that aren't useful for any
+other system anyway.
+Although it's called the standard 'C' library, that doesn't mean it can't be
+used in assembly language, in fact it's quite a bit easier to call the C
+functions than to deal directly with the OS, and there is no speed penalty in
+using the C library because it's been hand coded in assembly language anyway.
+Would you like to see what it's like to code using the standard C library? I've
+been talking about functions, and if you're familiar with C64 BASIC's functions,
+it's quite similar to that, except that you can pass more than one value to the
+function. It's basically the same as writing subroutines in assembly, where we
+usually pass values using the A,X & Y registers or a ZP value etc.. The only
+difference is that ALL values are passed using the CPU stack, which is easily
+accesible with the 65816. Ok let's take a look at the previous open file
+example:
+C code:        file = open("hello.txt",O_READ);
+assembly (16 bit regs):
+	       pea O_READ
+	       pea ^hellostr
+	       pea !hellostr
+	       jsr @_open	; C functions get "_" prepended to their names
+	       pla		; so you don't get them mixed with assembly ones
+	       pla
+	       pla
+	       stx file		; store the result in file
+	       sty file+2
+Notice that the values are placed onto the stack in reverse order, so they come
+out in the correct order when the function accessing them. They are also long
+jsr's because they aren't likely to be in the same bank as the calling program.
+You might think that having to pop the values back off the stack is cumbersome,
+and you're right. Why can't _open pop them off? Well it could, it'd need to do
+some messing around with the stack at the end but it'd make things look nicer.
+The reason it can't is because C functions don't always know how much data will
+be on the stack, so they might pop the wrong amount off. It may look ugly, but
+you get used to it.
+Now I'll give you a bigger example of what C code looks like after it's been
+compiled to prove that the 65816 is capable of producing half decent code. This
+will probably only make sense if you've done C programming before, so if
+you're not interested in this kind of thing skip this section..
+Here's a minimal version of the standard unix util 'cat', which concatenates
+files together and sends them to the screen or whatever the stdout file is, as
+it can be redirected in UNIX.
+#include <stdio.h>
+int main(int argc, char *argv[]) {
+	FILE *fp;
+	int ch=0;
+   	int upto=1;
+	if (argc<2) {
+		fprintf(stderr,"Usage: cat FILE ...\n");
+		exit(1);
+	}
+	argc--;
+   	while(argc--) {
+		fp = fopen(argv[upto++],"r");
+		if (!fp) {
+			perror("cat");
+			exit(1);
+		}
+		while((ch = fgetc(fp)) != EOF)
+			if (putchar(ch) == EOF) {
+				perror("cat");
+				exit(1);
+			}
+	   	fclose(fp);
+	}
+}
+and here's the (unoptomized) compiled version:
+#define _AS sep #$20:.as
+#define _AL rep #$20:.al
+#define _XS sep #$10:.xs
+#define _XL rep #$10:.xl
+#define _AXL rep #$30:.al:.xl
+#define _AXS sep #$30:.as:.xs
+	.xl		; make sure it's 16 bit code
+	.al
+	.(
+mreg 	= 1
+mreg2 	= 5
+	.text
++_main
+-_main:
+	.(
+RZ 	= 8		; RZ = register size: Two psuedo 32 bit registers
+LZ 	= 26		; LZ = Local size: size of the local variables for this
+			; function
+	phd
+	tsc		/* make space for local variables */
+	sec
+	sbc #LZ
+	tcs
+	tcd		/* set up the DP register as the frame pointer */
+	stz RZ+1	/* ch = 0; */
+	lda #1		/* upto = 1; */
+	sta RZ+7
+	lda LZ+6	/* if (argc < 2)  NOTE: could be just      */
+	.(		/* cmp #2 : bpl L2                         */
+	cmp #2		/* but the compiler doesn't know how far   */
+	bmi skip	/* away L2 is.				   */
+	brl L2
+skip 	.)
+	pea ^L4		/* fprintf(stderr,"Usage: cat FILE ...\n"); */
+	pea !L4
+	pea ^___stderr
+	pea !___stderr
+	jsr @_fprintf
+	tsc
+	clc
+	adc #8
+	tcs
+	pea 1		/* exit(1) */
+	jsr @_exit
+	pla
+L2:
+	lda LZ+6	/* argc-- NOTE: dec LZ+6 would be better! */
+	dec
+	sta LZ+6
+	brl L6
+L5:
+	pea ^L8		/* This rather large bit of code is all for */
+	pea !L8		/* fopen(argv[upto++],"r");		    */
+	lda RZ+7	/* arrays don't translate so well! */
+	sta RZ+9
+	lda RZ+9
+	inc
+	sta RZ+7
+	ldx RZ+9
+	lda #0
+	.(
+	stx mreg2
+	ldy #2
+	beq skip
+blah 	asl mreg2
+	rol
+	dey
+	bne blah
+skip 	ldx mreg2
+	.)
+	clc
+	tay
+	txa
+	adc LZ+8
+	tax
+	tya
+	adc LZ+8+2
+	sta mreg2+2
+	stx mreg2
+	lda [mreg2]
+	tax
+	ldy #2
+	lda [mreg2],y
+	pha
+	phx
+	jsr @_fopen
+	tsc
+	clc
+	adc #8
+	tcs
+	stx RZ+11
+	sty RZ+11+2
+	ldx RZ+11	/* assign it to fp */
+	lda RZ+11+2
+	sta RZ+3+2
+	stx RZ+3
+	.(		/* if (!fp)
+	lda RZ+3
+	cmp #!0
+	bne made
+	lda RZ+3+2
+	cmp #^0
+	beq skip
+made 	brl L13
+skip 	.)
+	pea ^L11	/* perror("cat"); */
+	pea !L11
+	jsr @_perror
+	pla
+	pla
+	pea 1		/* exit(1) */
+	jsr @_exit
+	pla
+	brl L13
+L12:
+	pei (RZ+1)	/* putchar(ch); */
+	jsr @_putchar
+	pla
+	stx RZ+15
+	lda RZ+15	/* if (putchar(ch) == EOF)
+	.(
+	cmp #-1
+	beq skip
+	brl L15
+skip 	.)
+	pea ^L11	/* perror("cat"); */
+	pea !L11
+	jsr @_perror
+	pla
+	pla
+	pea 1		/* exit(1)
+	jsr @_exit
+	pla
+L15:
+L13:
+	pei (RZ+3+2)	/* fgetc(fp); */
+	pei (RZ+3)
+	jsr @_fgetc
+	pla
+	pla
+	stx RZ+17	/* ch = fgetc(fp); */
+	lda RZ+17
+	sta RZ+1
+	lda RZ+17	/* while ((ch = fgetc(fp)) != EOF) */
+	.(
+	cmp #-1
+	beq skip
+	brl L12
+skip 	.)
+	pei (RZ+3+2)	/* fclose(fp); */
+	pei (RZ+3)
+	jsr @_fclose
+	pla
+	pla
+L6:
+	lda LZ+6	/* while(argc--) */
+	sta RZ+9
+	lda RZ+9
+	dec
+	sta LZ+6
+	lda RZ+9
+	.(
+	cmp #0
+	beq skip
+	brl L5
+skip 	.)
+	ldx #0		/* return from main() */
+L1:
+	tsc
+	clc
+	adc #LZ
+	tcs
+	pld
+	rtl
+	.)
+	.text
+-L11 	.asc "cat",0
+-L8 	.asc "r",0
+-L4 	.asc "Usage: cat FILE ...",10,0
+	.)
+As you can see, there's still quite a bit to be optomized as far as the compiler
+is concerned, but the code is still quite good.
+Having a C compiler and a standard C library that contains the most used
+standard functions, is going a long way towards being able to port UNIX's and
+other similar environments' applications. So what i've done is create a 65816
+backend for a free ANSI C compiler called LCC.
+I'm no longer talking theory here either, since a little while ago I decided to
+give my standard C library and the compiler a test on portability, with some
+great results. I've managed to do extremely simple porting jobs on: Pasi's C
+versions of his gunzip and puzip, Andre Fachat's XA 6502/65816 cross compiler,
+Marco Baye's ACME cross assembler. All of which, besides ACME, so far seem to be
+working exactly how they should. There'd be thousands of open source programs
+that could easily be ported to JOS, many of which wouldn't be of much use to
+anyone, but still!
+	         ---------------------------------------------
+	        | Multitasking - Seeming to do it all at once.|
+	         ---------------------------------------------
+We've all had experience with multitasking so I won't bore you too much.
+For our purposes, it means being able to do several things at once.
+But what actually is a "thing"? They're usually called "processes" or "tasks". I
+usually call them processes, so that's what I'll refer to them as.
+There are two main types of multitasking, pre-emptive and co-operative. The
+latter is as you would expect, processes need to co-operate together in order to
+work, processes can't "do their own thing". Pre-emptive multitasking is the more
+flexible approach, because processes don't need to explicitly hand over the
+processor to another process, they just have it taken away from them if they use
+it for too long. So it was a pretty easy choice for which kind of multitasking
+JOS would have, pre-emptive of course!
+You might think that the C64 already does multitasking because programs normally
+set up interrupt routines to go off during the processing of the program, so it
+can do more than one thing, but that's a very special case of what I'm
+referring to here. I'm referring to the ability to run seperate unrelated
+programs at the same time, like reading your email, and typing in a text
+editor. We'd all like to be able to do that wouldn't we? Particularly if we've
+got the processing power and RAM to do it, and the SuperCPU certainly does.
+Each process "owns" resources. The resources I'm talking about are simply parts
+of the computer and OS like RAM, interrupts, kernel IPC objects, and some other
+things.
+Along with the resources it owns, each process has a number of attributes. First
+of all it needs a unique identifier, so anything that wants to talk to it knows
+how to address it. In Unix-like systems, this is called a Process IDentification
+(PID). In JOS a PID is just a positive integer, simple.
+Along with other processes being able to address it, the PID is used so that the
+OS can keep track of which resources the process actually owns, and when it
+exits (or is explicity killed) the OS can free up those things and let other
+processes use them.
+Processes can start other processes, so everything except the first process
+keeps track of who its parent was in its Parent PID (PPID). You may wonder
+what use it is to keep track of the parent? It's always been used in UNIX to set
+up IPC, but it really isn't needed in JOS, apart from cosmetic purposes, since
+JOS has better IPC mechanisms. That's the first example of "Just because it's in
+UNIX doesn't mean it's needed", and there are plenty of others.
+In JOS, a process can own multiple "threads" of execution. Threads are what most
+people's idea of what a process is: some code running.
+Consider starting a C64 game, which has several different interrupt routines
+running concurrently. We certainly wouldn't consider each interrupt routine to
+be a seperate program, and that's generally the idea behind threads, except
+threads are at the mercy of the pre-emptive scheduler. Almost the same result
+can be achieved by creating multiple processes, but why go to the hassle of
+loading and executing two tightly related processes with 1 thread each, when
+you can do the same thing with 1 process that has 2 threads? A good example
+of this is JOS's very own web server, which creates new threads whenever a
+new connection has been established by a client.
+Some new technologies are particularly keen on the use of threads, namely JAVA
+and the BeOS. A good example of using multiple threads is given by BeOS, which
+starts a seperate thread for every window displayed on the screen, so it can
+update its on-screen appearance and remain responsive to the user, while also
+doing other processing.
+Unix programs have generally just started other processes if they wanted to do
+two of their own things at once. Threads are much cleaner and nicer. Threads
+themselves have their own attributes, such as priority (the higher the priority
+the more processor time it's likely to get), state (whether they are running or
+waiting for something), stack and zero page space, and some other things.
+I know i've mentioned that JOS uses pre-emptive multitasking, but that doesn't
+mean that doing:
+		jmp *
+is a good idea! Programs should still try and co-operate.
+A typical menu program on C64 using the kernel has a structure something like
+this:
+. Setup variables and interrupts
+. Set up menu
+. Check for input
+. If no input go back to 3
+. Process input
+If you were to run this program on a multitasking system, it would chew up a lot
+of processing time and slow everything else down. Polling for input on a
+multitasking system is generally a bad thing, but blocking and waiting for input
+is a good thing. So instead it would be best to do:
+. Setup variables and interrupts
+. Set up menu
+. Wait for input
+. Process input
+Now this is the correct way to do it, as it only uses up cpu time when it's
+actually received some input. But what happens if every process is waiting? What
+runs then? Well there is a special process that runs when no other processes
+are, it's called the Idle process, and does what it's name suggests, just sits
+there and idles. Here is the thread code that runs in my idle process:
+nully		jmp nully
+For some reason I started calling it the Null process, and it's called that all
+throughout JOS...
+I have introduced you to a couple of the main ideas behind multitasking, but
+wouldn't you like to know how it's done? Well here's how JOS does it..
+For starters, since it's pre-emptive multitasking, JOS needs some way of
+interrupting the currently running process after it's consumed its alloted
+time. The C64 has 4 CIA timers capable of producing IRQ's and NMI's, and in
+JOS's case i've decided to let it use CIA 1 Timer A, which produces an IRQ. This
+of course means that a process could stop itself from being interrupted by doing
+an SEI, but if they behave well that won't happen!
+Rather than set this timer to the amount of time before a process should be
+pre-empted (called a "timeslice"), I double up the use of TIMER A as the system
+counter, which is used for timing another kind of process resource: timers.
+Timers can either count upwards, or downwards and give off an alarm. They really
+need a higher precision than a timeslice, so they set the timer to 20
+milliseconds (about 1 PAL screen). The timeslice is then calcualted as 3 counts
+of this timer i.e. 60 milliseconds. Why don't I use TIMER B for the system
+timer? Well, because I want to leave as many resources open for application and
+device driver processes.
+I mentioned that processes and threads each have their own attributes, these
+attributes are stored in Process Control Blocks (PCB's) and Thread Control
+Blocks (TCB's).
+Every process has a PCB, and every process has at least one thread, which has
+it's own TCB. There is one process which is always loaded, and that's the Null
+process. Each process's PCB and TCB's are contained in everyone's favourite data
+structure, the circular (or double) linked lists. The Null PCB is always at the
+head of the PCB list, and PCB's will only ever be on this one list, since they
+are either alive (in the list), or dead (no PCB exists!).
+Threads on the other hand can be in various states, but in particular they can
+be ready for the CPU, or waiting for something (blocked). When a thread is
+ready, it's just waiting for its turn at the CPU, and it goes on the Ready
+list, which is a queue. The Null thread is ALWAYS at the back of this queue, so
+it only gets to run if nothing else can. The ordering of this queue is up to a
+part of the kernel called the Scheduler.
+	    Front				 	Back
+	 ------------	       ------------	    --------------
+    --	|  Thread A  |  ----- |  Thread B  | ----- |  Null Thread |  --
+   |	 ------------	       ------------	    --------------     |
+    -------------------------------------------------------------------
+Some OSes have complex schedulers which take in many parameters, like priority
+and various CPU time measurements. On multi-user OSes like UNIX, this is
+important because it wants to be "fair" to all processes. But for our purposes
+and many other OSes, it's usually a whole lot simpler than that, it's just a
+simple matter of which ever process/thread has the highest priority can run. If
+two threads have the same priority, it normally comes down to "round robin"
+scheduling, where they just take it turns. JOS doesn't even implement priorities
+properly yet, because they actually don't make much difference to the normal
+processing, at the moment it's just a simple round robin scheduler that doesn't
+care about priorities.
+What if a thread is blocked? It'll go onto a wait queue, and will return to the
+ready queue only when it's ready to run. At this stage of JOS, the only thing a
+thread will need to block for is IPC.
+You may be wondering about the issue of relocatable code, as we all know the
+nor the 65816 is designed for running relocatable code. Sure, branches are
+relative to the PC, but nothing else is. So everything needs to be physically
+relocated before executing, and to do this properly without needing to code in a
+specific way, a relocatable binary format is needed. Fortunately for me, Andre
+Fachat had already designed such a format for OS/A65, and it fits JOS nicely
+because it includes 65816 extensions. Of course you need a special assembler to
+output this file format, which is where XA comes in. XA now even compiles for
+JOS, so self hosted development is now possible. The binary format will be
+talked about in greater detail in a future article.
+Well, it's all very fine having a bunch of processes running, but that's no
+operating system.. Who's looking after the devices? Who's managing the memory?
+And how do we ask the drivers to do something for us? It's all IPC...
+	    --------------------------------------------------
+	   | Inter Process Communication - Let's get talking! |
+	    --------------------------------------------------
+Before I get into the specifics of IPC, I should give an idea of what typically
+happens when JOS boots. Because JOS has a very scalable microkernel design, it
+can load as many different device drivers and applications at boot time as it
+wants and infact they can loaded and removed anytime at all. So there is no one
+bootup procedure in JOS. There are certain things that happen every time,
+however.
+For starters, JOS has 2 system processes, which are always started at bootup.
+They aren't actually loaded off disk because they are part of the microkernel
+code. One is the memory manager and the other is the process manager.
+The memory manager as you would expect manages all the memory, but it doesn't
+manage the Process space memory (Bank 0), that's the job of the Microkernel.
+Process space memory (or kernel memory) is where all the PCB's, TCB's, Stack
+space and Direct Page space is located. The Memory manager, manages all the
+other RAM, e.g. Ram in Bank 1 and above, although, if there is no SuperRAM, it
+allocates 00e000-010000 as system Ram instead of using it as kernel space RAM,
+since it's more likely that you will run out of System Ram rather than kernel
+RAM.
+I won't go into the specifics of the Memory Manager just yet, I'll just tell you
+that it performs the following requests:
+Allocate any size block of RAM.
+Free RAM.
+Allocate any size block of Bank Aligned RAM. (Needed in some cases).
+Reallocate RAM.
+See how much RAM is left.
+See what the largest block is.
+All these things are requested via IPC, but there are Shared library routines
+(such as malloc, free, realloc etc) for preparing the right IPC messages to
+send.
+The process manager's main functions are loading new processes + shared
+libraries, and looking up device drivers & file-systems. Whenever you open a
+file, you first must send a message to the process manager asking it where to
+send the open message.
+The very first process to start however is called the "init" process (it's
+actually built into the microkernel, "init" isn't a filename), which starts the
+system processes, then it starts a simple Ramdisk process and loads another
+process from the ramdisk called "initp".
+The "initp" process should then load a proper filesystem and disk device driver
+also from the ramdisk, and "mount" this filesystem and executes another file
+this time called "init".
+Note that "mounting" is preparing a filesystem for use, and all filesystems
+should actually be "unmounted" before switching off, because all changes may not
+be actually written to disk yet, even though the applications think they are.
+I'm guessing this is why Macintoshes refuse to let you take a disk out without
+the OSes permission!
+The "init" file will usually be a shell script, and is responsible for starting
+up most of the drivers. A shell script, if you've never heard of it, is a file
+that has lists of commands to be run by the system, or more specifically the
+shell program. If you've ever seen MS-DOS .bat files, you'll know what I mean.
+A typical init script has to load a user interface, unless of course you're
+using your machine as some kind of server, in which case you wouldn't need
+one and could save yourself a bit of memory!
+The text based interface would require the console driver (con.drv), and the
+shell (sh). The console driver is capable of 4 virtual consoles, which you can
+switch between by pressing CBM and 1-4. This lets you exploit multitasking, as
+you could be running 4 different text apps on each of the screens. The shell is
+a pretty basic shell at the moment (like DOS's command.com), but it's enough to
+let you load and run any program. It also has support for pipes, but now I'm off
+topic..
+The init script could instead load the GUI, which I'm sure most people would
+prefer to a text based interface!
+The script also should load other drivers like: tcp/ip, ppp, digi sound driver,
+other filesystems, modem drivers etc... Everything is of course optional, which
+is where Microkernels really excell over their monolithic counterparts.
+Well that's what happens at boot time, but how do the drivers and the
+applications communicate? I've been mentioning "messages", and that's all that
+JOS's IPC is: message passing. Message passing is a fast and effective way to do
+IPC, and for a microkernel this is essential. I chose message passing because
+it's the most flexible method, and you can actually implement other types of IPC
+by using message passing.
+You can think of message passing as an extended subroutine call, but rather than
+being a call to a subroutine, it's a call to another process. A process, or in
+particular a thread, can "send" a message to another thread, the other thread
+"receieves" it, and then after it has processed it, "replies" to it.
+You can't just send a message and expect it to be receieved straight away, the
+receiver has to be ready to receive it, which may not be straight away. If the
+receiver isn't ready, the thread that sent the message will block and wait until
+it's ready. Once the receiver has received it, it processes the message, and
+will issue a reply, which then unblocks the sender, which can then continue
+processing. This type of message passing is called "synchronous" message
+passing, as it requires synchronization between the two threads. It may help to
+think of "sending" as doing a JSR, "receiving" as the Program Counter being
+transferred to the routine, and "replying" as executing an RTS. It's a little
+more complicated than that, but essentially that's what it's like.
+There is a great description of this kind of IPC at http://www.qnx.com/ in their
+technical section, with diagrams and all -- highly recommended!
+Normally, OSes have to copy messages between processes, because each process
+gets its own address space, and can't view the memory of other processes, but
+as we know, the 65816 doesn't have an MMU so all memory is shared, which means
+that messages don't need to be copied, which gives it a significant speed
+increase over message passing in OSes with MMU's. Of course it does mean that
+processes can accidently screw up another process's memory, but who cares! :)
+All messages in JOS is directed at Channels. Channels are a resource that allow
+threads to receive message from other threads. Generally device drivers register
+a channel and use it to receive requests from applications. Channels are
+referred to by number, the only channels that have fixed numbers are the memory
+manager (0) and the process manager (1). All other channels are looked up by
+sending a message to the process manager's channel, e.g. Channel 1.
+What exactly is a message? All the JOS system calls for IPC just deal with 24
+bit pointers to messages, and the actual message data itself can be anything!
+However the first byte of the message should be the message code, and always is
+in JOS system messages. You could of course make your own protocol for your own
+IPC, but it's probably not a good idea.
+Each different kind of driver has its own set of message codes..
+#define PROCMSG	$80
+#define MEMMSG	$40
+#define MMSG_Alloc	0+MEMMSG
+#define MMSG_AllocBA	1+MEMMSG
+#define MMSG_Free	2+MEMMSG
+#define MMSG_Left	3+MEMMSG
+#define MMSG_Large	4+MEMMSG
+#define MMSG_LeftK	5+MEMMSG
+#define MMSG_LargeK	6+MEMMSG
+#define MMSG_KillMem	7+MEMMSG
+#define MMSG_Realloc	8+MEMMSG
+#define PMSG_Spawn	PROCMSG+0
+#define PMSG_AddName	PROCMSG+1
+#define PMSG_ParseFind	PROCMSG+2
+#define PMSG_FindName	PROCMSG+3
+#define PMSG_QueryName	PROCMSG+4
+#define PMSG_Alarm	PROCMSG+5
+#define PMSG_KillChan	PROCMSG+6
+#define PMSG_WaitPID	PROCMSG+7
+Those are the messages defined for the Process manager and Memory manager. Each
+message code defines its own structure, for example the MMSG_Alloc message has
+the structure:
+	.word MMSG_Alloc
+	.word !Size
+	.byte ^Size,0
+The message codes $e0-$ff are left for processes that want their threads to
+communicate with each other.
+Anything that wants to receive messages needs to have some code like this:
+		jsr @S_makeChan		; make a channel System call
+		sta Chan		; save it
+loop		lda Chan		;
+		jsr @S_recv		; receieve a message from channel
+		stx MsgP		; Save X/Y in MsgP
+		sty MsgP+2		; MsgP is a zero page variable
+		sta RcvID		; Save RcvID - for replying
+		lda [MsgP]
+		and #$ff		; 8 bit message code
+		cmp #MSGCODE		; check which type
+		beq processMes		; and process it
+		cmp #MSGCODE2
+		beq processMes2
+		...
+		ldx #-1			; replying with $ffff in X and Y
+		txy			; means "message not understood"
+		lda RcvID
+		jsr @S_reply		; reply and loop back for more messages
+		bra loop
+All device drivers have a message loop like that. Which forces them to be
+modular, and thus easier to code.
+Ok now let's see what sending a message would look like:
+		lda #PROC_CHAN
+		ldx #!Message
+		ldy #^Message
+		jsr @S_sendChan		; Send the message
+		...
+Message		.word PMSG_WaitPID,2	; Wait for PID 2 to finish.
+*note: it's generally a good idea to put messages on the stack, rather than use
+global variables, since using the stack is thread safe. No other thread will
+accidentally wipe over the message because they each have their own stack.
+Just about everything that you consider an OS to be is done in JOS via IPC.
+This includes file operations, such as opening and closing, reading and writing
+files. How does the filesystem driver know which file you want to access after
+you've opened it? It could include a connection number in the IO_READ and
+IO_WRITE messages (you guessed it, the message codes for reading and writing!).
+That's a little cumbersome, though. There is a better solution: connections.
+What is a connection? It's a kernel object which keeps a track of the
+destination channel of the messages directed at it. It also has an ID associated
+with it, so server processes can tell which file, for example, it refers to.
+Each process has a so called "file descriptor list" associated with it. People
+who know much about UNIX programming will know about this. In JOS, this table is
+really just a connection table. This table is just an array of connection
+numbers, which the process can access. Each element in the array can point to
+any connection number, which means that two file descriptors can actually point
+to the same file, and in the case of the first three it usually does. The first
+three are STDIN, STDOUT & STDERR, and they usually point to the screen, but not
+always!
+An example File Descriptor list: (0 = no connection)
+      1      2     3     4     5     6     7     8     9 .... 32
+ ------------------------------------------------------------------
+|  1  |   1   |  1  |  2  |  3  |  0  |  0  |  0  |  0  |  0  .... |
+ ------------------------------------------------------------------
+E.G.
+Connection 1 is connected to the /dev/con/1 device (the screen). Thus STDIN,
+STDOUT and STDERR all point to this.
+Connection 2 is connected to a file "/blah.txt" which is on the 1541 filesystem.
+Connection 3 is a tcpip connection to altavista.com.
+Connections are global objects, and whenever a process is loaded, it inheritis
+its file descriptor table from the parent, which is how it receives its STDIN,
+STDOUT and STDERR. File descriptors can also be explicitly redirected to other
+connections, or just not inherited at all. This is how JOS performs shell
+redirection.
+I've discussed JOS's synchronous message passing, but what happens if you don't
+want to block and wait for a reply? You might just want to notify a server that
+an event has occurred, and don't need to know if it received it, nor what it
+thinks about it.
+In this case you can send a pulse. A pulse is a tiny message (just 4 bytes),
+which doesn't require a reply. Probably the best property of pulses are that
+they can be sent during an interrupt. A good example of doing this is the
+console driver, which implements virtual consoles. The console driver starts an
+interrupt routine which scans the keyboard and checks for CBM key plus 1-4 and
+then sends a pulse message to its channel telling it to switch consoles.
+By now you might be thinking "Microkernels must be real slow with all that
+process switching", but the switching code is pretty fast, particularly at
+mhz. There isn't as much switching as you would expect either, considering
+that IO_READ and IO_WRITE messages deal with buffers as large as 64k, so it's
+not as if ever single character requires a switch.
+	    --------------------------------------------------
+	   |     Device Independence - Everything's a file!   |
+	    --------------------------------------------------
+One of the major things that people who are learning UNIX have to learn, is that
+practically everything is a file. Devices such as the keyboard and screen (the
+console) are accessed using a file. Why you may ask! Well there isn't one
+compelling reason, but it just makes it handy if you can access the console as a
+file, especially for debugging. Take for example, the ability to redirect screen
+output to files, a program doesn't have to be explicity designed for doing that
+if everything is a file, including the console, it's just a simple matter of
+changing the output file.
+Not only are devices files, but filesystems can be "mounted" on any directory,
+which gets rid of the need for devices numbers. Navigating through different
+filesystems is just a simple matter of changing directories. It also means that
+applications don't concern themselves with what the actual filesystem and device
+is, just that it's there. So applications will work with any devices that have
+drivers.
+Ok so now you know some of the reasons behind the "everythings a file", so how
+is it done in JOS? I mentioned that the process manager is in charge of "looking
+up" channels, but how does it perform this lookup?
+The process manager contains a table with entries for file-systems, devices and
+special processes. File-systems are names that end in a '/', device files
+usually start with '/dev/' and special processes start with '*'. So the table
+may look something like this:
+Name		Channel		Unit
+/		2		1		; file-system mounted at /
+/usr/		2		2		; file-system mounted at /usr/
+*digi		3		0		; digi driver
+*tcpip		4		0		; tcp/ip
+/net/		4		1		; tcp connections
+*cbmfsys	5		0		; the cbm file-system
+*packet		6		0		; the packet driver (ppp/slip)
+/dev/null	1		0		; the process manager handles
+						; this
+The name and channel fields are self explainitory but the Unit field allows a
+channel to determine which of its names was used.
+Whenever the process manager receives a request to look something up, depending
+on what type of request it is (special process requests don't), it will prepend
+the processes Current Working Directory to the filename (unless the name starts
+with a '/'), and then parse the name for '.' and '..' directories, which alter
+the string.
+So for example you ask for the file "./hello/./../afile.txt" and your CWD was
+"/usr/files/" it would be parsed as:
+"/usr/files/afile.txt"
+This string is then compared to the table, and finds the longest full match, in
+this case it would find "/usr/" and return channel 2, unit 2, plus the string
+"files/afile.txt", which is what is left over after subtracting.
+The great thing about this whole "pathname space" approach is that processes
+don't necessarily need to know what they're dealing with, and pieces of the OS
+can be loaded and unloaded at will for the ultimate in scalability and
+modularity.
+You might think that setting up the request and dealing with the responses,
+every time you want to open a file is a bit tiresome, but it's all handled for
+you with the "open" library call.
+		pea O_READ
+		pea ^devcon1
+		pea !devcon1
+		jsr @_open	; returns file number in x or -1 on failure
+		pla
+		pla
+		pla
+		...
+devcon1		.asc "/dev/con/1",0
+That's all for now. In the next article, i'll be writing about process
+loading + shared libraries, networking, terminal IO (console + modems) + some
+other things...
+Hopefully you will have learned something from this article, and can see the
+power that a real multitasking OS, such as JOS, can bring to the SuperCPU.
+Any feedback goes to jmaginni@postoffice.utas.edu.au , i'm particularly on the
+lookout for people who can help with hardware; docs, code etc...
+Also, check the JOS homepage at http://www.jolz64.cjb.net/ and join the JOS
+mailing list if you're interested in updates.
+.......
+....
+..
+.                                    C=H 19
+::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
+VIC KERNAL Disassembly Project - Part III
+Richard Cini
+September 1, 1999
+Introduction
+============
+	In the last installment of this series, we examined the two remaining
+hard-coded processor interrupt vectors, the IRQ and NMI vectors. Although we
+took a complete look at the routines, we did not examine some of the
+subroutines that IRQ and NMI call. We'll examine these routines first.
+	Having completed the main processor vectors, we'll continue this
+series by examining other Kernal routines.
+Remaining Subroutines
+=====================
+	The NMI and IRQ routines together call 11 subroutines, five of which
+we previously examined in Part I of this series, and two call the NMI vectors
+in the BASIC ROM and A0 Option ROM. So, let's examine the four remaining
+subroutines.
+UDTIM/IUDTIM
+------------
+	The IRQ vector calls the update time function UDTIM through the
+jump table at the end of the Kernal ROM, while the NMI function skips the
+intermediate call through the jump table and directly calls the time function.
+UDTIM:
+FFEA 4C 34 F7    JMP IUDTIM			;$F734
+F734   ;==========================================================
+F734   ; IUDTIM - Update Jiffy Clock (internal)
+F734   ;	Called by IRQ; no params; no return
+F734   ;
+F734          IUDTIM
+F734 A2 00    		LDX #$00
+F736 E6 A2     		INC CTIMR2	;bump timer tick
+F738 D0 06     		BNE UDTIM1	;not 0, move on (no roll)
+F73A E6 A1     		INC CTIMR1	;rolled-over, INC next reg
+F73C D0 02     		BNE UDTIM1	;not 0, move on (no roll)
+F73E E6 A0     		INC CTIMR0	;rolled-over, INC next reg
+F740
+F740          UDTIM1			;done updating registers,
+F740					; check for 24hr roll
+F740					; A0-A2 hold max of 4F1A00
+F740 38          	SEC		;set carry
+F741 A5 A2       	LDA CTIMR2	; get LSB
+F743 E9 01       	SBC #$01	; minus 1
+F745 A5 A1       	LDA CTIMR1	;
+F747 E9 1A       	SBC #$1A	; minus 1Ah
+F749 A5 A0       	LDA CTIMR0	;
+F74B E9 4F       	SBC #$4F	; minus 4Fh
+F74D 90 06       	BCC UDTIM2	; ok
+F74F
+F74F 86 A0       	STX CTIMR0	;24-hr roll-over, so reset
+F751 86 A1       	STX CTIMR1	; registers to zero
+F753 86 A2       	STX CTIMR2
+F755
+F755             UDTIM2			;no 24-hr rollover-continue
+F755 AD 2F 91    	LDA D2ORAH	;check for STOP key
+F758 CD 2F 91    	CMP D2ORAH
+F75B D0 F8       	BNE UDTIM2	;not same, check again
+F75D
+F75D 85 91       	STA STKEY	;same, save status and exit
+F75F 60          	RTS
+	UDTIM is called every 1/60th of a second by the IRQ routine, and
+begins execution by incrementing each of the time-keeping registers in
+the Zero Page locations $A0 to $A2. As each is incremented, it is checked
+for roll-over (i.e., for the count exceeding the maximum allowed for the
+register). Taken together, the three consecutive memory locations make-up
+the "jiffy clock" (as the VIC's RTC is sometimes referred; a "jiffy" being
+/60 of one second).
+	At the label UDTIM1, the code checks for a 24hr roll-over. The three
+byte-sized registers (no pun intended) can store the 24-hour jiffy count
+of 5,184,000 decimal, or 4F1A00 hex. If the count exceeds this value, the
+registers are reset to zero.
+	The BASIC TI function accesses the jiffy clock, representing the
+count as a decimal number. Similarly, the TI$ function represents the jiffy
+clock as a 24-hour HH:MM:SS clock instead of a jiffy count.
+	UDTIM is also responsible for processing the STOP key on behalf of
+the IRQ and NMI routines, so if a user program handles either of these
+interrupts, the programmer must remember to call UDTIM in order to maintain
+the time clock and STOP key functionality.
+CCOLRAM
+-------
+	This short routine is responsible for determining the location of
+the color ram. In the VIC, the screen and color memory locations change based
+on the amount of RAM installed, as follows:
+	Function	Unexpanded		Expanded
+	--------	----------		--------
+	User BASIC	$1000 00010000		$1200 00010010
+	Screen Memory	$1E00 00011110		$1000 00010000
+	Color RAM	$9600 10010110		$9400 10010100
+	The two least significant bits of the most-significant byte of each
+of the screen memory and color RAM pointer registers defines the resulting
+location. If the bit pattern of the screen memory is "10", the code sets
+the color RAM base to page $96. If the bit pattern is "00", the code sets
+the color RAM base to page $94.
+	The two other possible bit patterns result from screen memory
+beginning at $1100 or $1F00, and produce color RAM locations of $9500
+and $9700, respectively. The $1100 starting location will actually work,
+but result in 256 bytes of wasted user RAM. The $1F00 starting location
+will not work since the color RAM locations overlap the I/O Block 2
+addresses, which have no RAM associated with them.
+EAB2   ;==========================================================
+EAB2   ; CCOLRAM - Calculate pointer to color RAM
+EAB2   ;
+EAB2        CCOLRAM
+EAB2 A5 D1       	LDA LINPTR	;get ptr to screen RAM LSB
+EAB4 85 F3       	STA COLRPT	;save it as color LSB
+EAB6 A5 D2       	LDA LINPTR+1	;get screen RAM MSB
+EAB8 29 03       	AND #%00000011	;mask bits 0-1
+EABA 09 94       	ORA #%10010100	;OR with $94 to get color
+EABA					; RAM pointer
+EABC 85 F4       	STA COLRPT+1	;save as color ptr MSB
+EABE 60          	RTS		;exit
+ISCNKY
+======
+	This is the low-level keyboard scan function which is called
+times per second by the IRQ routine. ISCNKY scans the keyboard matrix
+to retrieve a keypress, maps the key number to its ASCII equivalent, and
+places the ASCII value at the end of the keyboard buffer. If IRQs are
+disabled, the keyboard scanning is suspended. ISCNKY is accessible to user
+programs through the Kernal jump table, although calling it with interrupts
+enabled is not recommended.
+To retrieve a character from the keyboard, a user program would typically
+call GETIN ($FFE4), the buffered keyboard input routine. GETIN returns
+the ASCII value of the character at the head of the keyboard buffer, or
+zero if no character is available.
+VIA2 is directly connected to the keyboard. Port B is used as the column
+strobe and Port A is used as the row input. To read the keyboard matrix,
+the code brings all column strobe lines to 0 and reads the row inputs, in
+order, until a key is found (or not found). The code also begins decoding
+the ASCII using the "unshifted" decoding table. Three other decoding tables
+are for shifted, C= (Commodore) keys, and shift+C= keys.
+EB1E   ;===========================================================
+EB1E   ; ISCNKY - Scan keyboard
+EB1E   ;	Scans keyboard for character. Called by IRQ routine.
+EB1E   ;  ASCII value placed in keyboard buffer.
+EB1E             ISCNKY
+EB1E A9 00       	LDA #$00	; set shft/ctrl flag to 0
+EB20 8D 8D 02    	STA SHFTFL
+EB23 A0 40       	LDY #$40	; assume no keys pressed
+EB25 84 CB       	STY KEYDN	;  ($40=no keys)
+EB27 8D 20 91    	STA D2ORB	; bring all column bits low
+EB2A AE 21 91    	LDX D2ORA	; read row inputs
+EB2D E0 FF       	CPX #$FF	; any character keys pressed?
+EB2F F0 5E       	BEQ PROCK1A	; no, exit
+EB31 A9 FE       	LDA #%11111110	; begin testing at COL 0
+EB33 8D 20 91    	STA D2ORB	; output bit pattern
+EB36 A0 00       	LDY #$00	; zero character count reg
+					; set default translation
+					; table to Table 1
+EB38 A9 EA       	LDA #$EA 	;FIXUP2+2;#$5E
+EB3A 85 F5       	STA KEYTAB
+EB3C A9 EA       	LDA #$EA 	;FIXUP2+3;#$EC
+EB3E 85 F6       	STA KEYTAB+1
+EB40
+EB40             ISCKLP1		; begin testing loop
+EB40 A2 08       	LDX #$08	; 8 rows to test in column
+EB42 AD 21 91    	LDA D2ORA	; get column
+EB45 CD 21 91    	CMP D2ORA	; test again - debounce
+EB48 D0 F6       	BNE ISCKLP1	; not equal, retry
+EB4A
+EB4A             ISCKLP2		; got bit pattern
+EB4A 4A          	LSR A		; shift through carry flag
+EB4B B0 16       	BCS ISCNK1+3	; CY=1 for key not pressed
+EB4D
+EB4D 48          	PHA		; save column bit pattern
+EB4E B1 F5       	LDA (KEYTAB),Y	; .Y is index into ASCII
+EB4E					;  translation table
+EB50 C9 05       	CMP #$05	; ASCII > 5, move on
+EB52 B0 0C       	BCS ISCNK1	;  (<5=shft, C=, STOP, CTRL)
+EB54
+EB54 C9 03       	CMP #$03	; ASCII=3 STOP key
+EB56 F0 08       	BEQ ISCNK1	; got STOP so skip flag updt
+EB58
+EB58 0D 8D 02    	ORA SHFTFL	; save SHFT, CTRL, C= flag
+EB5B 8D 8D 02    	STA SHFTFL
+EB5E 10 02       	BPL ISCNK1+2	; move on to next row in col
+EB60
+EB60             ISCNK1
+EB60 84 CB       	STY KEYDN	; save key#
+EB62 68          	PLA		; restore col bit pattern
+EB63 C8          	INY		; increment key count
+EB64 C0 41       	CPY #$41	; 64 keys scanned?
+EB66 B0 09       	BCS ISCNEXIT	; yes, return ASCII value
+EB68
+EB68 CA          	DEX		; go on to next row in col
+EB69 D0 DF       	BNE ISCKLP2	;  {loop}
+EB6B
+EB6B 38          	SEC		; done with first column, so
+EB6C 2E 20 91    	ROL D2ORB	;   move on to next column
+EB6F D0 CF       	BNE ISCKLP1	;  {loop}
+EB71
+EB71             ISCNEXIT		; function evaluation vector
+EB71 6C 8F 02    	JMP (FCEVAL)	; CINT1A points this to SHEVAL
+EB71					; the shift evaluation code
+EB74             ;
+EB74             ; Process key image
+EB74             ;
+EB74             PROCKY
+EB74 A4 CB       	LDY KEYDN	; get key number (as index)
+EB76 B1 F5       	LDA (KEYTAB),Y	; covert key# to ASCII code
+EB78 AA          	TAX		; copy ASCII code to .X
+EB79 C4 C5       	CPY CURKEY	; is it the same as the
+					;  current character?
+EB7B F0 07       	BEQ PROCK1	; yes, do repeat eval
+EB7D
+EB7D A0 10       	LDY #$10	; set repeat delay
+EB7F 8C 8C 02    	STY KRPTDL
+EB82 D0 36       	BNE PROCK4	; not same key, so exit
+EB84
+EB84             PROCK1
+EB84 29 7F       	AND #%01111111	; test for {REVERSE}
+EB86 2C 8A 02    	BIT KEYRPT	; do test
+EB89 30 16       	BMI PROCK2	;  BIT7 set? reverse only
+EB8B 70 49       	BVS PROCK5	;  BIT6 set? alpha or reverse
+EB8D
+EB8D C9 7F       	CMP #$7F	; last non-revs'd character
+EB8F
+EB8F             PROCK1A
+EB8F F0 29       	BEQ PROCK4
+EB91
+EB91 C9 14       	CMP #$14	; {DEL}?
+EB93 F0 0C       	BEQ PROCK2	;  process {DELETE}/INS
+EB95
+EB95 C9 20       	CMP #$20	; {SPACE}?
+EB97 F0 08       	BEQ PROCK2	;  process {SPACE}
+EB99
+EB99 C9 1D       	CMP #$1D	; {<-}?
+EB9B F0 04       	BEQ PROCK2	;  process cursor right/L
+EB9D
+EB9D C9 11       	CMP #$11	; {CRS DN}?
+EB9F D0 35       	BNE PROCK5	;  process cursor down/U
+EBA1
+EBA1             PROCK2
+EBA1 AC 8C 02    	LDY KRPTDL	; get repeat delay
+EBA4 F0 05       	BEQ PROCK3	;  if 0, check repeat speed
+EBA6
+EBA6 CE 8C 02    	DEC KRPTDL	; not done delaying, so exit
+EBA9 D0 2B       	BNE PROCK5	;  {exit}
+EBAB
+EBAB             PROCK3
+EBAB CE 8B 02    	DEC KRPTSP	; decrement repeat speed cnt
+EBAE D0 26       	BNE PROCK5	; not done delaying, so exit
+EBB0
+EBB0 A0 04       	LDY #$04	; delay speed cnt reached 0,
+					;  so reset speed count
+EBB2 8C 8B 02    	STY KRPTSP	; save it
+EBB5 A4 C6       	LDY KEYCNT	; get count of keys in kbd
+					;  buffer
+EBB7 88          	DEY		; at least one, so exit
+EBB8 10 1C       	BPL PROCK5	;  {exit}
+EBBA
+EBBA             PROCK4
+EBBA A4 CB       	LDY KEYDN	; get current key number
+EBBC 84 C5       	STY CURKEY	; re-save as current
+EBBE AC 8D 02    	LDY SHFTFL	; get current shift pattern
+EBC1 8C 8E 02    	STY LSSHFT	; save as last shft pattern
+EBC4 E0 FF       	CPX #$FF	; re-check for any keys down
+EBC6 F0 0E       	BEQ PROCK5	; none, so exit
+EBC8
+EBC8 8A          	TXA		; restore ASCII code to .A
+EBC9 A6 C6       	LDX KEYCNT	; get count of keys in buffer
+EBCB EC 89 02    	CPX KBMAXL	; more than maximum allowed?
+EBCE B0 06       	BCS PROCK5	; yes, drop current key press
+EBD0
+EBD0 9D 77 02    	STA KBUFFR,X	; save ASCII code in buffer
+EBD3 E8          	INX		; increment buffer count and
+EBD4 86 C6       	STX KEYCNT	;   save it
+EBD6
+EBD6             PROCK5
+EBD6 A9 F7       	LDA #$F7	; clear bit for COL3 (STOP key
+EBD8 8D 20 91    	STA D2ORB	; is in COL3); save it to VIA
+EBDB 60          	RTS		; exit routine
+	Part of the keyboard scanning includes evaluating whether or not
+key modifier keys are pressed. Modifier keys include the SHIFT, Commodore,
+and CTRL keys. The ASCII decoding table is changed based on whether or not
+one of these keys is pressed. It also looks like the following code went
+through several revisions considering the multiple patch areas (filled with
+NOPs). Alternatively, these areas could support alternate decoding schemes
+for different languages.
+EBDC             ;
+EBDC             ; Evaluate for shift/CTRL/Commodore keys
+EBDC             ;
+EBDC             SHEVAL
+EBDC AD 8D 02    	LDA SHFTFL	; 1=SHFT; 2=C> 4=CTRL
+EBDF C9 03       	CMP #$03	; C> + shft?
+EBE1 D0 2C       	BNE PROCK6A	; no, select proper decode
+EBE3             			;  table
+EBE3 CD 8E 02    	CMP LSSHFT	; is the pattern the same as
+EBE6 F0 EE       	BEQ PROCK5	; last one? Yes, exit.
+EBE8
+EBE8 AD 91 02    	LDA SHMODE	; different pattern
+EBEB 30 56       	BMI PROCKEX	;  {exit}
+EBED
+EBED EAEAEAEAEAEA	.db $ea, $ea, $ea, $ea, $ea, $ea, $ea, $ea
+EBF3 EAEA
+EBF5 EAEAEAEAEAEA	.db $ea, $ea, $ea, $ea, $ea, $ea, $ea, $ea
+EBFB EAEA
+EBFD EA EA EA    	.db $ea, $ea, $ea
+EC00
+EC00 AD 05 90    	LDA VRSTRT	; get char ROM address
+EC03 49 02       	EOR #%00000010	; flip between L/C and U/C
+EC05 8D 05 90    	STA VRSTRT	;  ROMs
+EC08
+EC08 EA EA EA EA 	.db $ea, $ea, $ea, $ea
+EC0C
+EC0C             PROCK6			; proper ROM is set, so go
+EC0C 4C 43 EC    	JMP PROCKEX	;  on with key image process
+EC0F
+EC0F             PROCK6A		; define correct decode table
+EC0F 0A          	ASL A		; multiply index by 2
+EC10 C9 08       	CMP #$08	; >= 8 (5 entries)?
+EC12 90 04       	BCC $+6		; no, continue
+EC14
+EC14 A9 06       	LDA #$06	; yes, assume CTRL table
+EC16
+EC16 EAEAEAEAEAEA	.db $ea, $ea, $ea, $ea, $ea, $ea, $ea, $ea
+EC1C EAEA
+EC1E EAEAEAEAEAEA	.db $ea, $ea, $ea, $ea, $ea, $ea, $ea, $ea
+EC24 EAEA
+EC26 EAEAEAEAEAEA	.db $ea, $ea, $ea, $ea, $ea, $ea, $ea, $ea
+EC2C EAEA
+EC2E EAEAEAEAEAEA	.db $ea, $ea, $ea, $ea, $ea, $ea, $ea, $ea
+EC34 EAEA
+EC36 EA EA       	.db $ea, $ea
+EC38
+EC38 AA          	TAX		; reset pointer to point
+EC39 BD 46 EC    	LDA KDECOD,X	;  at right decoding table
+EC3C 85 F5       	STA KEYTAB	;  .A is table index
+EC3E BD 47 EC    	LDA KDECOD+1,X
+EC41 85 F6       	STA KEYTAB+1
+EC43
+EC43             PROCKEX
+EC43 4C 74 EB    	JMP PROCKY	; continue processing image
+EC46
+EC46		;========================================================
+EC46		; KDECOD - Pointers to keyboard decode tables
+EC46		;
+EC46             KDECOD
+EC46 5E EC       	.dw KDECD1		;$EC5E Unshifted
+EC48 9F EC       	.dw KDECD2		;$EC9F Shifted
+EC4A E0 EC       	.dw KDECD3		;$ECE0 Commodore
+EC4C A3 ED       	.dw KDECD5		;$EDA3 Control
+EC4E 5E EC       	.dw KDECD1		;$EC5E Unshifted
+EC50 9F EC       	.dw KDECD2		;$EC9F Shifted
+EC52 69 ED       	.dw KDECD4		;$ED69 Decode
+EC54 A3 ED       	.dw KDECD5		;$EDA3 Control
+EC56 21 ED       	.dw GRTXTF		;$ED21 Graphics/text control
+EC58 69 ED       	.dw KDECD4		;$ED69 Decode
+EC5A 69 ED       	.dw KDECD4		;$ED69 Decode
+EC5C A3 ED       	.dw KDECD5		;$EDA3 Control
+	Now, let's look at a few very simple routines just so that we can
+check them off of the list:
+IIOBASE
+=======
+	IIOBASE is the internal label behind the Kernal IOBASE function.
+Calling IOBASE results in code execution being transferred to IIOBASE:
+IOBASE:
+FFF3 4C 00 E5    	JMP IIOBASE		;$E500 IOBASE
+	IOBASE returns the address of the beginning of the I/O region of
+the VIC memory map in the .X and .Y registers. Locations $9110 to $912F are
+the addresses reserved for the VIC's two 6522 VIAs. This is the first routine
+in the Kernal ROM.
+	The value of this function in the VIC is questionable since there
+is no way to change the address at which the VIAs appear, and interestingly,
+the Kernal code does not call IOBASE at all. The Kernal instead relies on
+hard-coded addresses.
+	However, one could conclude that the actual location of the VIAs
+in the VIC's address space changed during the Kernal development process,
+so IOBASE was somehow used to normalize the address. This also enabled code
+portability between the VIC and the C64.
+	The BASIC ROM appears to call IOBASE in the RND function. The
+existence of other calls is unknown at this time since the BASIC ROM has
+yet to be disassembled.
+E500	;==========================================================
+E500	; IIOBASE - Return I/O base address
+E500	;	Returns the IO Base address in .X(LSB) and .Y(MSB)
+E500           IIOBASE
+E500 A2 10       	LDX #$10	;return $9110 as IO Base
+E502 A0 91       	LDY #$91
+E504 60          	RTS
+ISCREN
+======
+	ISCREN is the internal label behind the Kernal SCREEN function.
+Calling SCREEN results in code execution being transferred to ISCREN:
+SCREEN:
+FFED 4C 05 E5    	JMP ISCREN	;$E505 SCREEN
+E505 ;==========================================================
+E505 ; ISCREN - Return screen organization
+E505 ;	Returns the screen organization .X(columns) and .Y(rows)
+E505 ;
+E505           ISCREN
+E505 A2 16       	LDX #$16       ;return 22 cols x 23 rows
+E507 A0 17       	LDY #$17
+E509 60          	RTS
+	This code returns the row and column organization of the screen in
+the .X and .Y registers. It doesn't appear that the Kernal calls this
+function to determine the screen size, instead relying on hard-coded
+values under the assumption that the screen is 22x23. So, this function's
+utility appears to be purely for the benefit of user code.
+IPLOT
+=====
+	IPLOT is the internal label behind the Kernal PLOT function.
+Calling PLOT results in code execution being transferred to IPLOT:
+PLOT:
+FFF0 4C 0A E5    	JMP IPLOT		;$E50A
+E50A	;===============================================================
+E50A	; IPLOT - Read/set cursor position
+E50A	; On entry:  SEC to read cursor position to .X(row) and .Y(col)
+E50A	;            CLC to save cursor position from .X(row) and .Y(col)
+E50A	;
+E50A           IPLOT
+E50A B0 07       	BCS READPL	;carry set? yes, read position
+E50C 86 D6       	STX CURROW	;save row...
+E50E 84 D3       	STY CSRIDX	;...and column
+E510 20 87 E5    	JSR SCNPTR	;update position
+E513
+E513           READPL
+E513 A6 D6       	LDX CURROW	;return row...
+E515 A4 D3       	LDY CSRIDX	;...and column
+E517 60          	RTS
+	The Kernal again does not call this function, instead managing cursor
+movement by changing the values of the current row and current cursor index
+(i.e., the cursor's position in the row). Upon storing the new cursor
+location, the code commits the changes by jumping to an internal routine
+in CINT1 which is responsible for moving the cursor block in screen memory.
+Conclusion
+==========
+	In this installment, we examined several routines, two of which
+are integral to the operation of the VIC. The Jiffy clock routine also
+scans the STOP key, which is important to overall usability and the ability
+to halt a program. The second routine, SCNKEY, is responsible for scanning
+the keyboard matrix. That's pretty important, too.
+	Next time, we'll examine more routines in the VIC's KERNAL, including
+I/O routines.
+.......
+....
+..
+.                                    C=H 19
+::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
+JPEG: Decoding and Rendering on a C64
+------------------------------------- Stephen Judd
+						<sjudd@ffd2.com>
+				      Adrian Gonzalez
+						<adrianglz@globalpc.net>
+	In the C64 world there are a disturbing number of cases where
+people have said, "It can't be done on a C64."  This goes on for a while
+until someone actually takes a look at the task and its requirements,
+and says "Not only can it be done, but it can be done easily."  JPEG is
+one such case.
+	This article is divided into two parts.  In part 1, I discuss
+JPEGs and the decoding process.  The primary focus is on several important
+issues not covered well, if at all, in existing documentation, especially
+the IDCT; the article also covers the principles of decoding JPEGs and
+JFIF files.
+	In part 2, Adrian discusses Floyd-Steinberg dithering, and how it
+can be applied to various C64 graphics modes (and how it can be used to
+display jpegs!).  In both articles the actual C64 code and algorithms will
+of course be discussed, and the source code is available at
+	http://www.ffd2.com/fridge/jpeg
+for both the decoder and the renderer.
+	The decoder is about 4k of code, the renderer is around 2k, and
+there are about 9k of tables.  With the grayscale versions, there is
+ample memory left over.  With the color IFLI versions, memory is extremely
+tight -- there are 32k of graphics, six 24-bit image buffers.  The Huffman
+trees are stored in the screen RAM area.  The renderer crams all the data
+into the graphics area, which is why you see garbage while the image is
+rendering.  There are a few tens of bytes free in page 0, probably 100-200
+bytes free in page 1, and a few tens of bytes free in page 2, and that's it!
+Everything else just kind-of barely/exactly fits, and then only for
+'typical' jpegs.
+	Finally, Errol Smith deserves a special mention as the guy who first
+tracked down some decent JPEG documentation.  Errol pointed me in the right
+direction and within a few weeks we had JPEGs on a 64.
+------
+Part I: Decoding jpegs
+------
+	Decoding jpegs is a fairly straightforward process, and in
+recent years some free documentation has become available.  This
+article is meant to complement that documentation, by filling in
+some of the gaps and detailing some of the broader issues, not to
+mention some specific implementation issues.  The first part of this
+article covers general jpeg issues: encoding/decoding, Huffman tree
+storage, Fourier transforms, JFIF files, and so on.  The second part
+covers implementation issues more specific to the C64.
+	There are several sources of JPEG documentation online and in
+the library.  Out of all of them, I found three that were particularly
+useful:
+	Cryx's jpeg writeup at http://www.wotsit.org
+	ftp://ftp.uu.net/graphics/jpeg/wallace.ps.gz, an updated
+		article from one which appeared in the April 1991
+		"Communications of the ACM" (v34 no.4).
+	"JPEG Still Image Data Compression Standard" by William B. Pennebaker
+	 and Joan L. Mitchell, published by Van Nostrand Reinhold, 1993,
+	 ISBN 0-442-01272-1.
+The first, Cryx's writeup, is a programmer's description of JPEG files, so
+it has good, detailed descriptions of the encoding/decoding process and
+the file structure/organization, including a list of all the JFIF segments
+and markers.  The second reference is also excellent, and explains most of
+the basic principles of JPEGs, the how's and why's of the standard, and has
+some helpful examples.  The third reference (the book) is very comprehensive,
+but is written in a way which I feel tends to obscure the important points.
+Nevertheless, it has an entire chapter on the discrete cosine transform and
+several fast DCT algorithms, which is invaluable.  As an additional source
+of information, some people might find the IJG's cjpeg/djpeg source code
+helpful.
+JPEG Encoding/Decoding
+----------------------
+	It's really simple, folks.
+	Start with a grayscale image and divide it up into 8x8 pixel blocks
+(just like a C64 bitmap).  The first block is the upper-left corner of the
+image; the second block is to the right of the first block, and so on until
+the end of the row is reached, at which point the next row begins.
+	The next step is to take the two dimensional discrete cosine
+transform of each 8x8 component, and filter out the small-amplitude
+frequencies.  This will be explained in detail later, but the net result
+is that you are left with a lot of zeros in the 64-byte data block, and
+a few nonzero elements from which you can reconstruct the main features
+of the image.  This filtering process is called the "quantization" step.
+	The next step is to RLE-encode the resulting 8x8 block (since most
+of the components are zero), and finally to Huffman-encode the RLE-encoded
+data.  And that's it.  Done.  Finished.  Repeat Until Done.
+	Color pictures are similar, but now each pixel has an 8-bit R, G,
+and B value, so there will be three 8x8 blocks, for a total of 24 bits
+(not quite like a C64 bitmap...).  The RGB values are converted to
+luminance/chrominance values (RGB -> YCrCb), but what's important is that
+for each 8x8 section of a color image there are three 64-byte blocks of
+data, and each block is encoded as above.
+	So to summarize: transform the data, filter ("quantize") the
+transformed data, and RLE-encode and Huffman-encode the result.  Do this
+for each component, and then move on to the next 8x8 block.  Therefore,
+to decode the image data:
+	read in the bits,
+	find the Huffman code,
+	unpack the RLE,
+	de-quantize the data,
+	and perform the inverse transform,
+for each 8x8 block of image data to be plotted to the screen.  Repeat
+until done.
+	It turns out that there are other methods of JPEG compression
+in the standard, such as arithmetic compression, but this is rarely
+supported due to legal reasons (lame software patent owned by IBM, AT&T,
+and Mitsubishi), and it doesn't seem to offer substantial compression
+gains.  There are also different types of jpegs, most importantly
+"baseline" or sequential jpegs, and "progressive" jpegs.  In
+a progressive jpeg the image is stored in a series of "scans" which go
+from lower to higher resolution.  I'll be focusing on baseline jpegs
+(which are more common).
+	Finally, it turns out that an 8x8 block of image data doesn't
+have to correspond to an 8x8 block of pixels.  For example, each byte
+of data might represent an average of a 2x2 block of pixels, so an 8x8
+block of data might expand to a 16x16 block of pixels.  In a JPEG
+the "sampling factor" determines how to expand an 8x8 block of data.
+You can see that this can offer substantial compression gains, but will
+coarsen the data; on the other hand, if the data is already coarse, it's
+a way of getting a whole lot for nothing.  Most color jpegs use one-to-one
+pixel mapping for the luminance, and one-to-four (each data byte = 2x2 pixel
+block) mapping for the two chrominance components.  From an implementation
+standpoint, this means that a decoder typically decodes 16 scanlines at a
+time (16x16 pixel chunks).  For more details, see Cryx's document.
+	Before a JPEG can be decoded, though, the decoder needs a fair
+amount of information, such as the Huffman trees used, the quantization
+tables used, information about the image such as its size, whether it's
+a color or a grayscale image, and so on.  In a JPEG file, all information
+is stored in "segments".
+Segments
+--------
+A JPEG segment looks like the following:
+	[header]	Two bytes, starting with $FF
+	[length]	Two bytes, in hi/lo order (not usual 6502 lo/hi)
+	[data]		Segment data
+A list of JPEG (and JFIF) headers can be found in Cryx's document.
+Let's have a look at a hex dump of a jpeg file (from unix, use
+"od -tx1 file.jpg | more"):
+0000000  ff d8 ff e0 00 10 4a 46 49 46 00 01 01 01 00 48
+0000020  00 48 00 00 ff fe 00 17 43 72 65 61 74 65 64 20
+0000040  77 69 74 68 20 54 68 65 20 47 49 4d 50 ff db 00
+The first two bytes are $ff $d8 -- these two bytes identify the
+file as a jpeg.  All jpegs start with ff d8.
+	Next we encounter the header ff e0.  ff e0 is a special header
+which identifies this file as a JFIF file.  It turns out that in the
+original JPEG standard a specific file format is not given; this
+in turn led to different companies using their own formats, to try and
+establish the "standard".  The JFIF format was put forwards to remedy
+this problem, and is the de-facto standard -- but more on this later.
+	In a JFIF file, the JFIF segment always follows the JPEG ID byte.
+You can see that it is length 16, and that that length includes the two
+length bytes.  Immediately following the length byte are the four letters
+J F I F and the number 0; following that are some bytes for revision numbers,
+the x/y densities, and some thumbnail info.
+	The next segment starts with the header ff fe.  This is the
+"comment" header; the length is $17 bytes.  Following the length bytes
+are the ascii codes for "Created with The GIMP", a popular image
+processing program.  The next header is ff db, which is the "Define
+Quantization Table" header.  And on it goes, until the actual image
+data -- a stream of Huffman-encoded bits -- is reached.
+Huffman Decoding
+----------------
+	If you don't know anything about Huffman decoding, then I suggest
+you read Pasi's nice article in C=Hacking #16, which has a nice example.
+Briefly, a Huffman tree is a binary tree whose left and right branches
+correspond to bits 0 and 1 respectively; starting from the top of the
+tree, you read bits and move left or right accordingly until a leaf
+is reached, containing the Huffman code value.  Then you start over again
+at the top of the tree and decode the next Huffman code.
+	In a JPEG, Huffman trees are stored in "Define Huffman Tree"
+segments (header = ff c4):
+0000300                                ff c4 00 1c 00 00
+0000320  01 05 01 01 01 00 00 00 00 00 00 00 00 00 00 03
+0000340  01 02 04 05 06 00 07 08
+The first byte in the DHT segment (00) is an ID byte -- JPEGs can have up to
+eight Huffman trees.  This is then followed by 16 bytes, where each byte
+represents the number of Huffman codes of lengths 1, 2, 3, ..., up to
+length 16, followed by the Huffman code values. In the above example, there
+are 0 codes of length 1, 1 code of length 2, 5 codes of length 3, and so
+on.  Following these 16 bytes are the Huffman values: 3, 1, 2, 4, ..., 8.
+But what are the Huffman codes corresponding to those values?
+	It turns out that these trees are so-called "canonical Huffman trees",
+and work as follows: to get the next code, add 1 to the current code.
+When the length increases, add 1 and shift everything left.  The exception
+is that you don't increment until the first code is defined, so the first
+code is always zeroes.
+	For example, to decode the above DHT segment, start with Huffman
+code = 0.  There are no codes of length 1, so we shift it left to get
+code = 00 (and don't add 1 because the first code hasn't been defined yet).
+There is one code of length 2, so we read the first Huffman value and
+assign it to the current code
+	Code	Value
+	  3
+That's the only code of length two, so now we move to length 3 by incrementing
+and shifting: code = 010.  There are five values of length 3, and the next
+five Huffman values are 1, 2, 4, 5, 6, so the Huffman tree is now
+	Code	Value
+	  3
+	  1
+	  2
+	  4
+	  5
+	  6
+and the rest of the Huffman tree is given by
+	  0
+	  7
+	  8
+What's the best way to implement a Huffman tree?
+The most obvious way is to use five bytes per "node", i.e.
+	left pointer	(2 bytes)
+	right pointer	(2 bytes)
+	value		(1 byte)
+where the left and right pointers are just offsets to be added to the
+current pointer, and if left = right = $FFxx then this is a leaf.  If you
+fetch a bit that says "go left", and the left pointer = $FFxx (but right
+pointer is valid) then you've hit an invalid Huffman code -- i.e. decoding
+error.  This five-byte method is used in jpx (grayscale decoder).
+	But there is another rather cool method, first described to me by
+Errol Smith, which uses only two bytes per node.  Now, the five-byte method
+works fine in jpx, but in the full-color IFLI jpz code -- well, suddenly
+memory becomes extremely tight, and without this routine jpz probably
+wouldn't have happened on a stock machine.  The routine is also very
+efficient, especially if implemented using 16-bit 65816 code.
+	The trick is simply to organize the tree such that if the current
+node is at location NODE, then the left node is at NODE+2 and the right
+node is at NODE+(NODE).  Leaf nodes can be indicated by e.g. setting the
+high bit.  So the decoding process is:
+	get next bit
+	if 0 then pointer = pointer + 2
+	if 1 then pointer = pointer + node value
+	if high byte of node value < $80 then loop
+For example, the first part of the earlier Huffman tree
+	  3
+	  1
+	  2
+	  4
+would be encoded as
+d 00 04 00 03 80 04 00 01 80 02 80 00 00 00 00 04 80
+-----|-----|-----|-----|-----|-----|-----|-----|-----|
+Try decoding the Huffman values, using the above algorithm.
+Astute readers may ask the question: won't you decode incorrectly if
+there is no left node?  Even more astute readers can answer it: in a
+canonical Huffman tree, the only nodes without left-node pointers are
+leafs.
+	To see this, consider a counterexample: a tree that looks like
+		o
+	       /
+	      o
+               \
+		o
+This corresponds to Huffman code 01 -- one move left, one move right.
+In a canonical Huffman tree, the only way to generate the code 01 is to
+increment the code 00; since code 00 has already occured, there must be
+a left-node.  In a canonical Huffman tree, you always create a left-node
+before creating a right-node.  So error checking this kind of tree amounts
+to checking the right-pointer; the only nodes without left-pointers are leafs.
+Moreover, since left-nodes are always created first, you can add nodes in
+the order they are created -- you never have to insert nodes between
+existing nodes.
+	Pretty nifty, eh?
+Restart Markers
+---------------
+	The image data in a jpeg is a stream of Huffman-encoded bits.
+The jpeg standard allows for "restart markers" to be perodically inserted
+into the stream.  Thus a decoder needs to keep count of how far it is
+in the data stream, and periodically re-synchronize the bitstream.  So
+far so good -- this is explained in detail in Cryx's document.
+	What _isn't_ explained is that the restart markers do not merely
+re-synchronize the data stream, but when a restart marker is hit the DC
+coefficients need to be reset to zero.  That is, it really does "restart"
+the decoder.
+	What's a DC coefficient, you may ask?  It's the very first element
+in the 8x8 array, and instead of encoding the actual value a jpeg encodes
+the _offset_ from the previous value.  That is, the decoded DC element is
+added to the current DC value to get the new value.  That value needs
+to be reset to zero when a restart marker is hit.
+	Most jpegs do not use restart markers, but unless you reset the
+coefficient you're going to spend a few months wondering why Photoshop images
+don't decode correctly.
+	Why is it called the DC coefficient?  You'll have to read the section
+on Fourier transforms for the answer.
+	Note also that when the byte $FF is encountered in the data stream
+it must be skipped; the exception is if it is immediately followed by a 00,
+in which case $FF00 represents the value $FF.  Why do I bring this up?
+Because Cryx's document could be interpreted by naive people like myself
+as saying this is true throughout a jpeg file, and it's only true within
+the image data -- that in other segments, $FF is a perfectly valid byte.
+Unpacking the RLE
+-----------------
+	Once a Huffman code is retrieved and decoded, the resulting byte
+represents RLE-compressed data to be uncompressed.  This procedure is
+described quite well in Cryx's document, so I'll just refer you to it.
+This is repeated until you are left with a 64-byte chunk of data which
+needs to be re-ordered and dequantized.  This process is again described
+in Cryx's document; briefly, during the encoding process, the original
+x8 data is re-ordered into a 64-byte vector as follows:
+  1  5  6  ...
+  4  7  13 ...
+  8  12 17 ...
+  11 18 24 ...
+19 23 ...
+22 ...
+	...
+That is, the first element in the vector is the (0,0) component of the
+x8 array, the next element is the (1,0) component, the next element is
+the (0,1) component, and so on.  The reason for this "zig-zag" ordering
+is to enhance the RLE-compression, since it concentrates the lower
+frequencies at the beginning of the vector and the higher frequencies --
+most of which are typically zero-amplitude -- at the end of the vector
+(more on this later).  The decoder thus needs to "un-zigzag" the vector
+back into an 8x8 array.  All de-quantization amounts to is multiplying
+each element by a corresponding element in a quantization table:
+	data[i,j] = data[i,j]*quant[i,j]
+The final step is to take the resulting 64-byte chunk and apply the
+inverse discrete cosine transform (IDCT).
+Fourier Transforms and the (I)DCT
+---------------------------------
+	Let's begin with the definition you'll see in any document on
+JPEGS (hear that?  That's the sound of one thousand eyes simultaneously
+glazing over).
+	OK, let's back up a moment.  In computers, grasping new ideas is
+usually straightforward: you read about it, play around with it a little,
+and ah, it makes sense.  Mathematics isn't like that.  These are ideas
+that took people decades and centuries to figure out.  College students
+spend multiple months, working hundreds of problems, to gain just a basic
+working knowledge of a subject.  There's simply a constant learning process.
+Fourier transforms represent a fundamentally different way of thinking,
+and the timescale for enlightenment in the subject is years, not minutes.
+So don't worry if you don't understand everything immediately; the purpose
+of this part isn't to make you an instant expert in Fourier transforms, but
+rather to give you a toehold into the subject that you can expand on over
+time.
+	So, let's begin with a definition that you'll see in any
+document on JPEGS.  The one-dimensional discrete cosine transform (DCT)
+of a function f(x) with eight points (x=0..7) may be written as
+	       2*x+1
+	F(u) = c(u)/2 * sum f(x) * cos(-----*u*PI),	u = 0..7
+			x=0		16
+where c(0) = 1/sqrt(2) and c=1 otherwise.  This may look very mysterious to
+you, and it should, because it is rather mysterious-looking.  For now,
+think of it as some sort of grinder: you insert f(x) into the grinder,
+turn the crank, and out pops a new function, F(u).  In other words, the
+original function f(x) has been _transformed_ into a new function F(u).
+	Notice that we need to perform a separate sum for each value of u:
+	F(0) = 1/(2*sqrt(2)) * sum f(x)
+	F(1) = 1/2 	     * sum f(x)*cos((2*x+1)*PI/16)
+	F(2) = 1/2	     * sum f(x)*cos((2*x+1)*2*PI/16)
+and so on.  So there are a total of eight summations, each of which
+involves eight summands, for a total of 64 operations to perform.
+	One of the important properties about this transform is that it
+is _invertible_.  That is, you can take a transformed function F(u),
+put it into the other end of the grinder, turn the crank backwards,
+and out pops the original function f(x).  Moreover it is _uniquely_
+invertible -- for every function f(x), there is one and only one transform
+F(u), and vice-versa (the functions f(x) and F(u) are often called
+a transform pair).  In this case, the inverse DCT (IDCT) is given by
+			 2*x+1
+	f(x) = 1/2 * sum c(u)*F(u) * cos(-----*u*PI),	x = 0..7
+		     u=0		  16
+You'll notice that it is very similar to the forward transform, except
+now the sum is over u, and c(u) is inside of the summation; as before,
+there are 64 sums total to perform.  Expanding the sum gives
+	f(x) = 1/2 * ( 1/sqrt(2) F(0) + F(1) * cos((2*x+1)*PI/16) +
+				        F(2) * cos((2*x+1)*2*PI/16) + ...)
+For now, just note that the original function f(x) is given by a sum
+of the transformed function F(u) times different cosine components.
+	The transform of a two-dimensional function f(x,y) is done by
+first taking the transform in one direction (e.g. the x-direction)
+followed by the transform in the other direction (e.g. the y-direction).
+Thus the two-dimensional 8x8 discrete cosine transform of a function
+f(x,y) may be written as
+         c(u)c(v)     7   7               2*x+1             2*y+1
+F(u,v) = --------- * sum sum f(x,y) * cos(-----*u*PI) * cos(-----*v*PI)
+       x=0 y=0               16                16
+ u,v = 0,1,...,7
+where, as before, c(0) = 1/sqrt(2) and c=1 otherwise.  The IDCT is then
+given by
+     7   7                        2*x+1             2*y+1
+f(x,y) = --- * sum sum c(u)c(v)*F(u,v) * cos(-----*u*PI) * cos(-----*v*PI)
+    u=0 v=0                        16                16
+ x,y = 0,1...7
+Note that some documentation (e.g. Cryx's document) incorrectly gives c(u)
+and c(v) as c(u,v) = 1/2 for u=v=0 and c(u,v) = 1 otherwise.
+	This is an _extremely_ expensive computation to do, requiring
+multiplies of cosines (and computations of the arguments of the
+cosines) to calculate the value at a _single_ point (x,y), and there are
+points in each 8x8 block, so, even discounting the argument computation
+(i.e. u*pi*(2*x+1)/16) we're looking at 64*64 = 4096 multiplications for
+_every_ 8x8 block of pixels (where these are 16-bit multiplications).  On
+a C64, in such a case, the decoding time could be measured in hours if not
+days.
+	But if this were the only way to compute a DCT, then JPEGs would
+never have been DCT-based.  There are much faster methods of computing
+Fourier transforms, that take advantage of the symmetries of the transform.
+You may have heard of the Fast Fourier Transform, which is used in almost
+all spectral computing applications; well, there are also fast DCT algorithms.
+The one I used is actually an adaptation of the FFT.
+	So the first task is: where do we find a fast DCT algorithm?  One
+place to look is existing source code, like cjpeg/djpeg.  Unfortunately
+I found it pretty incomprehensible, and hence tough to translate to 65816;
+it is also pretty large.  And it's basically impossible to debug a routine
+that isn't understood (if something goes wrong, then where's the error?).
+	The next place to look is the literature -- many papers have
+been written on fast DCT routines.  Unfortunately, the ones I found were
+quite dense, very general (we only need an 8x8 routine, not an NxN routine),
+and again, fairly complicated.
+	What is needed is a _simple_, but fast, IDCT algorithm.  Salvation
+came in the book by Pennebaker and Mitchell, mentioned at the beginning of
+the article and available in the library.  This book has several 8x8 DCT
+routines in it, with detailed discussions of the algorithms, both one-
+dimensional and two-dimensional.  The 2D one is again fairly lengthy, but
+the 1D ones are pretty fast and straightforward -- something like 29 adds
+and 13 multiplies to compute 8 components of a 1D DCT.  Moreover the 13
+"multiplies" are multiplies by constants, which means table lookups, not
+full multiplications.  Compare with at least 1024 full multiplies and adds
+using the DCT definition, and you can see that the fast routine is
+*hundreds* -- and possibly thousands -- of times faster.  To put this in
+perspective, it's the difference between taking 30 seconds to decode a picture
+and taking 1-2 hours -- maybe even 10 hours or more -- to decode the same
+picture!
+	As mentioned earlier, we can do a 2D IDCT by doing a 1D transform
+of the rows of some 2D array followed by a 1D transform the columns (or
+vice-versa).  Thus a 1D routine is all that is needed.  Although there
+are specialized 2D routines, they are quite large and significantly more
+complicated than a 1D routine.  Small and straightforwards Good; large
+and complicated Bad.  And cjpeg/djpeg makes the observation that they don't
+seem to give much speed gain in practice.
+	There's just one problem -- the book chapter discusses lots of
+_forwards_ DCT routines, but devotes just one paragraph to _inverse_ DCT
+routines!  "Just reverse the flowgraph" is the advice given, with a few
+hints on reversing flowgraphs.
+	To make a long story short, IF you reverse the flowgraph correctly,
+AND you overcome the errors/misleading notation in the book, AND you
+prepare the coefficients correctly before performing the transformation,
+then yes, by golly, it works!  And working code is awfully sweet after days
+of intense frustration!  I have included an easy-to-read Java version of
+the 1D IDCT routine at the end of this article.
+	At this point, the more experienced programmers are asking, how
+do you _know_ it works?  With so many possible 8x8 arrays, how do you test
+and debug such a routine?  To answer these questions, it is important to
+understand a few things about Fourier transforms.  In the process, we shall
+also see why JPEG is based on the DCT, and why it is so effective at
+compressing images.
+Fourier Transforms for dummies
+------------------
+	There are several ways of thinking about a Fourier transform.
+One way to think about it is that you can expand any function in a series
+of sines and cosines:
+	f(x) = a0 + a1*cos(wx) + a2*cos(2wx) + a3*cos(3wx) + ...
+		  + b1*sin(wx) + b2*sin(2wx) + b3*sin(3wx) + ...
+where the a0 a1 etc. are constant coefficients (amplitudes) and w 2w etc.
+are the frequencies.  In the discrete cosine transform, the function is
+expanded solely in terms of cosines:
+	f(x) = a0 + a1*cos(wx) + a2*cos(2wx) + a3*cos(3wx) + ...
+"Taking the transform" amounts to computing the coefficients a0, a1, a2, etc.
+Once you know them, you can reconstruct the original function by adding up
+the cosines.
+	Now, let's forget about computing the coefficients, and stand back
+for a moment and look at that expression.  Each coefficient tells "how much"
+of f(x) is in each cosine component -- for example, the value of a2 says
+"how much" of f(x) is in the cos(2wx) component.  Conversely, each
+coefficient tells us how much of each "frequency" there is in f(x) --
+a2 says how much frequency=2w there is, a0 says how much frequency=0 there
+is, and so on.
+	So another way of thinking about a Fourier transform is that it
+transforms a function from the space (or time) domain into the _frequency_
+domain -- instead of thinking about how much the function varies with x
+(how it varies in space), we can see how it varies with _w_, the frequency;
+instead of looking at "how much f" is at a given point in space or time,
+we can look at "how much f" is at a given frequency.
+	So, imagine measuring something simple, like the voltage coming
+out of a wall socket.  A plot of the signal will be a sinusoidal function --
+this is a graph of how the signal varies with time.  The Fourier transform
+of this signal, however, will have a large spike at 60Hz (or 50Hz if you're
+in Europe or .au).  Small amplitudes of other frequencies will probably be
+seen, too, indicating noise in the signal.  So a graph of how the signal
+varies with _frequency_ might look something like this:
+			|
+			|
+			|
+			|
+	--^^--^----^----+-^---^----
+Hz
+That is, lots of zero or very small amplitude frequencies, and a large
+frequency amplitude at around 60/50Hz.
+	If you've ever seen an equalizer display on a stereo, you've seen
+a Fourier transform -- the lights measure how much of the audio signal there
+is in a given frequency range.  When the bass is heavy, the lower frequencies
+will have large amplitudes.  When there's some high instrument playing (or
+lots of distortion), then the high frequencies will have large amplitudes.
+	Now we can take this a few steps further.  The frequencies convey
+a lot of information.  For example, cos(wx) wiggles very slowly if w is
+small, and wiggles very rapidly if w is large (and it doesn't wiggle at
+all if w=0).  (If you don't see this, just think of x as an angle which
+goes around a circle: if x goes around the circle once, then 7x goes
+around the circle seven times).  Therefore, a function which changes slowly
+will have a lot of low-frequencies in the transform; a function which changes
+rapidly will have large high-frequency components (rapid wiggles give rapid
+changes).
+	The zero frequency is special.  A constant function will have
+only the zero-frequency component (since cos(0x) is a constant).  Moreover,
+the zero-frequency represents the average value of the function over a
+period of cosine -- this is easy to see because the average value of cos(x),
+cos(2x), etc. is 0 over a full period: it is above zero half of the time,
+and below zero the other half of the time, and the two halves cancel.
+	Now consider an image.  A typical photograph changes fairly
+smoothly -- there aren't many sudden sharp changes from black to white.
+This means that the transform of some small area of the picture will have
+fairly large-amplitude low-frequencies, but not much in the way of high
+frequencies.  If those small-amplitude high-frequencies are simply thrown
+away, then the image won't change much at all -- the high frequencies
+represent super-fine details of the picture.  And that's why JPEG is a
+"lossy" algorithm, and why it gets such high compressions -- the idea is
+to throw away the fine details and the unnecessary components, and keep
+just the major features of the picture.  It's also why JPEG isn't so great
+for things like line-art, where the image can change rapidly -- you may
+have noticed that things like slanted lines tend to get jagged in a jpeg.
+	The important point to remember is that high frequencies correspond
+to rapid changes in the image, low frequencies correspond to smooth changes,
+and the zero frequency is the "average" value.  Because there were obviously
+electrical engineers on the JPEG comittee, the zero frequency is referred
+to as the "DC component" of the transform, and the nonzero frequencies are
+referred to as the "AC components" (for Direct Current and Alternating
+Current).
+	Finally, for completeness, note that there is a difference between
+a discrete Fourier transform and a continuous Fourier transform, namely
+that one gives the transform in terms of discrete frequencies (w, 2w, 3w,
+etc.) and the other gives the transform as a continuous function of
+frequency.  When dealing with discrete data -- like an 8x8 set of values --
+we necessarily use a discrete transform.
+	Now, how can you test a Fourier transform routine?
+Fourier Transforms for smarties
+------------------
+	The basic question is: how do we know if the IDCT is working
+correctly?  Quite simply, by feeding it a problem we already know the
+answer to.
+	Remember that we are working with transformed data; each element
+represents the amplitude of a specific frequency.  Imagine a transformed
+vector with a single nonzero element, for example, let a3=10 and all the
+other coeffs equal zero.  What will the inverse transform of this vector
+be?  Since a3 is the amplitude of cos(3x), the transform will simply be...
+a3*cos(3x)!  Similarly, if a1 is the only nonzero coefficient, the transform
+will be a1*cos(x).
+	The above explanation actually isn't _quite_ right, because of
+the form of the IDCT used:
+     7                  2*x+1
+	f(x) = --- * sum c(u)*F(u) * cos(-----*u*PI)
+    u=0                  16
+Now it should be easy to see that if, say F(3)=10, and all the other F's
+are zero, then the result of the transform -- whatever transform algorithm
+is used -- must be
+	f(x) = 5*cos(3*PI*(2x+1)/16)
+So, for a one-dimensional IDCT, it is easy to test each component separately
+and compare the result with the actual answer.  But what about a 2D IDCT
+that has many nonzero components?
+	There are two important properties of Fourier transforms which
+come into play here.  The first is that Fourier transforms are _linear_;
+a linear operator L satisfies
+	L(c*f1) = c*L(f1), where c = constant
+	L(f1 + f2) = L(f1) + L(f2)
+That is, constants factor out of the operator, and operating on the sum
+of two functions is the same as operating on each function separately
+and adding them together.  As a simple example, consider the operators
+L1(x) = x and L2(x) = x^2.  The first one satisfies the conditions above;
+the second one does not.  Some other linear transforms you may be familiar
+with are rotations, and taking the derivative.  You can test for yourself
+that the Fourier transform satisfies the above conditions; you can also
+look at the fast DCT algorithm and see that it only involves additions and
+multiplications by constants, which are all linear operations.
+	This property is enormously important here.  It first says that we
+can multiply the transformed data by a constant, and the constant will
+just multiply the final answer; said another way, if F(3)=10 and all other
+F's are nonzero, then we know that F(3)=const*10 will work too, no matter
+what the constant is!  So in testing one component at a time, you can
+pretty confidently say "F(3) works" (as opposed to "F(3)=10 works, and
+F(3)=11 works").  The _only_ thing that can cause problems is overflows
+and other _computer_ issues; the basic algorithm _cannot_.
+	Even more importantly, however, is that the transform of the sum
+of two functions is equal to the sum of the transforms.  If we know that
+	F1 = (0,0,10,0,0,0,0,0)
+works, and we know that the transform of
+	F2 = (0,0,0,10,0,0,0,0)
+works, then we _know_ that the transform of
+	F1 + F2 = (0,0,10,10,0,0,0,0)
+works!  Moreover, since we can multiply each function by arbitrary constants,
+we know that the transform of
+	(0,0,a,b,0,0,0,0)
+works, _no matter what a and b are_.  So we can _completely_ test a 1D DCT
+simply by testing each component _separately_.  The _only_ things that can
+cause problems are things like overflow, erronius multiplications, etc.
+	Now, what about a 2D IDCT?  The way a 2D IDCT is computed is by
+first transforming in one direction (e.g. the x-direction), then transforming
+in the other direction (e.g. the y-direction).  Therefore, we can compute
+the 2D IDCT by first transforming each row, then transforming each column
+(or vice versa).
+	Therefore, once the 1D IDCT works, so does the 2D IDCT.
+	So, to summarize: to test the routine completely we simply need
+to test each component of a 1D IDCT separately, and compare the result
+with the known answer.
+	And if you really want to test it on a 2D set of data, there is
+an example DCT array given in the Wallace paper (and the result of the
+inverse transform).
+Quantization revisited
+----------------------
+	The quantization step filters out all the small-amplitude frequencies.
+A JPEG can have up to four quantization tables; each table is a 64-byte (8x8)
+set of integers.  When encoding a JPEG, taking the DCT of an 8x8 block of data
+leaves an 8x8 block of amplitudes.  Each amplitude is divided by the
+corresponding entry in the quantization table, thus filtering out the small
+amplitudes in a weighted fashion.  The quantized amplitudes are then
+re-ordered into a 64-byte vector which concentrates the lower frequences
+(the ones more likely to be nonzero) at the beginning of the vector, and
+the higher frequencies (more likely to be zero) at the end of the vector.
+This last step (zig-zag reordering) clearly increases the efficiency of the
+RLE encoding of the amplitude vector.
+	The decoder just reverses these steps -- it dequantizes the data
+(i.e. multiplies by the quantization coefficients) and re-orders the data,
+before performing the IDCT.  Now, you may have noticed that the IDCT routine
+has to prepare the coefficients by multiplying (dividing) by a set of
+constants:
+    for (int i=0; i<8; i++)
+        F[i] = S[i]/(2.0*Math.cos(i*ang/2));
+(This is done because the algorithm is actually an adapted FFT routine).
+In principle, this step can be incorporated into the de-quantization step,
+since dequantization is also just multiplying by constants.  In a 1D
+transform this is very straightforward, but I see no way to extend it
+to the 2D transform.  That is, it is possible to incorporate the above
+into the quantization such that, say, the row transforms will not need
+preparation, but the column transforms will still need the preparation.
+I did not feel that this was a very useful "optimization", and simply
+mention it here for completeness.
+	Note also that a wise programmer would replace the Math.cos
+calls above with constants, if the code were to be actually used in
+a decoder.
+Miscellaneous
+-------------
+	You may recall that all JPEGs begin with FFD8, and JFIF files
+immediately follow this with the FFE0 JFIF segment.  Although most jpegs
+have the JFIF segment, some don't!  For example, some digital cameras do
+not include a JFIF header.  But the files decode just fine if you don't
+worry about it.
+	Moreover, be sure to skip unknown segments using the segment
+length byte -- as opposed to, say, moving forwards in the file until
+another valid segment header is found.
+	When reading some of the other jpeg documentation, you'll read
+that the byte $FF is a special byte, to be skipped (unless followed by $00).
+Just to be clear, this only applies to the image data -- $FF is a normal
+data byte within other segments.  Similarly, restart markers only appear
+within the image data.
+C64 Implementation
+------------------
+	As you probably understand by now, and as we shall see below,
+jpegs on a C64 are far from being an impossible task.  So to wrap up,
+this section will cover the main issues in implementing a jpeg decoder
+on a C64, and examine some of the comments regarding jpegs being
+"impossible" on a C64.
+	One frequently-heard comment was that a C64 doesn't have enough
+memory to decode a jpeg, so let's look at the numbers.  From the preceding
+discussion, jpegs require memory for
+- Quantization tables
+- Huffman trees
+- Image data
+The quantization tables are 64 bytes each, and there are a maximum of
+four -- so, no big deal.  Using the two-byte storage method, the Huffman
+trees typically take up around 1.5k, and using the 5-byte method they take
+on the order of 4k.  The image data is stored in a jpeg on a row-by-row
+basis, where each row is some multiple of 8 lines large.  The normal C64
+display is 320 pixels wide, so that means an image buffer size of
+x8 = $0A00 bytes per 8 scanlines.
+	So, a few K for the Huffman tables, and a few K for the image
+buffers.  I think you'll agree that these are hardly massive amounts of
+memory.
+	Now, as you may recall, the data decoded from a JPEG file is
+luma/chroma data -- Y (intensity) CrCb (chroma).  For a grayscale picture,
+all that is needed is the intensity -- there's no need to convert to RGB.
+You may also recall that, because of sampling factors, a jpeg might decode
+to 16x16 blocks of data (or more), which means several 320x8 image buffers
+need to be available -- at $0A00 bytes/buffer, there's plenty of buffer
+space available.
+	For a full-color picture, however, all three components need to
+be kept, which means three buffers for each 320x8 row of data, which means
+$1E00 bytes per row.  So there's still plenty of room for multiple buffers.
+Until, that is, you throw IFLI into the mix -- but more on this later.
+The bottom line is that jpegs really don't require much memory.
+	Another common comment was that the C64 was far too slow to do
+the necessary calculations, especially the discrete cosine transforms.
+As was stated earlier, the IDCT routine used in this program needs some
+adds and 13 multiplies to do a 1D transform.  More importantly, the
+"multiplies" are always multiplies by a constant -- which means they
+can be implemented using tables.  So, we're talking 29 16-bit adds and
+16-bit table-lookups for the IDCT, which is really pretty trivial.
+	Another important calculation is the dequantization, which means
+doing 64 integer multiplications per 8x8 data block.  Each integer is 8-bits
+large (and the result can be 16-bits), and the multiplications are done
+using the usual fast multiply routine (let f(x)=x^2/4, then
+a*b = f(a+b)-f(a-b)), as described in all the C=Hacking 3D articles.
+Again, not a big deal.
+	So, in summary, the mathematical calculations are well within
+the grasp of the 64.
+	In fact, all the routines are quite straightforward -- only the
+IDCT routine is special.
+	One important issue is grayscale versus color.  The first program
+released, jpx, is grayscale, and for several very good reasons.  Grayscale
+is much faster to compute, since no RGB conversion needs to be done (the
+intensity Y is exactly the grayscale levels).  It is more memory-efficient,
+since the color components may be thrown away, and the bitmap requirements
+are modest.  And it is easier and faster to render.
+	With some pretty solid fundamental routines and a reasonable
+grasp of the important issues, color was a reasonably straightforward
+addition to the code, with just one problem: memory.  IFLI requires
+k for the bitmaps.  The IDCT routine uses some 6k of tables.  At least
+two image buffers are needed, for almost 16k.  The RGB conversion code
+uses table lookups.  The renderer needs memory for image buffers and tables.
+The decoder needs memory for Huffman and quantization tables.  When we added
+it all up, there just wasn't room.
+	With a little more thought and planning, though, a few things
+became clear: first, IFLI doesn't use the first three columns, which
+means the image buffers only need to be 296x8 x 3 components = $1BC0 bytes
+(instead of 320x8 x 3).  Typical jpegs use a maximum sampling factor of 2,
+so using just two buffers requires $3780 bytes -- a savings of almost $0600
+bytes over a 320-pixel wide bitmap.  Moreover, the needs of the renderer
+came out to almost exactly 16K per bitmap, which means that all the data can
+be squished into the two IFLI bitmaps and sorted out later.  So by scrimping
+here and saving there, and economizing on tables and rearranging memory, we
+were able to cram everything into 64k, with just a few hundred bytes to
+spare -- pretty neat.
+	And that, I think, sums up JPEG decoding on a C64.
+/*
+ * idct.java -- Attempts to implement the IDCT by reversing the flowgraph
+ * as given in Pennebaker & Mitchell, page 52.
+ *
+ * Almost there!
+ *
+ * SLJ 9/15/99
+ */
+import java.lang.Math.*;
+import java.io.*;
+import java.util.*;
+public class idct2d {
+    // a1=cos(2u), a2=cos(u)-cos(3u), a3=cos(2u), a4=cos(u)+cos(3u), a5=cos(3u)
+    // where u = pi/8
+    static double ang = Math.PI/8;
+//    static double a1=0.7071, a2= 0.541, a3=0.7071, a4=1.307, a5=0.383;
+    static double	a1 = Math.cos(2.0*ang),
+			a2 = Math.cos(ang)-Math.cos(3.0*ang),
+			a3 = Math.cos(2.0*ang),
+			a4 = Math.cos(ang)+Math.cos(3.0*ang),
+			a5 = Math.cos(3.0*ang);
+//    static double[] f = {31, 41, 52, 65, 83, 15, 34, 117},
+    static double[] f = {10, 9.24, 7.07, 3.826, 0, -3.826, -7.07, -9.24},
+	     F = {0, 0, 0, 0, 0, 0, 0, 256},
+	     S = {0, 0, 0, 0, 0, 0, 0, 256};
+    static double[][] trans = new double[8][8];
+    void idct2d() {}
+  void calcIdct() {
+    double t1, t2, t3, t4;
+    // Stage 1
+    for (int i=0; i<8; i++)
+	F[i] = S[i]/(2.0*Math.cos(i*ang/2));
+    F[0] = F[0]*2/Math.sqrt(2.0);
+    t1 = F[5] - F[3];
+    t2 = F[1] + F[7];
+    t3 = F[1] - F[7];
+    t4 = F[5] + F[3];
+    F[5] = t1;
+    F[1] = t2;
+    F[7] = t3;
+    F[3] = t4;
+    //printF();
+    // Stage 2
+    t1 = F[2] - F[6];
+    t2 = F[2] + F[6];
+    F[2] = t1;
+    F[6] = t2;
+    t1 = F[1] - F[3];
+    t2 = F[1] + F[3];
+    F[1] = t1;
+    F[3] = t2;
+    //printF();
+    // Stage 3
+    F[2] = a1*F[2];
+    t1 = -a5*(F[5] + F[7]);
+    F[5] = -a2*F[5] + t1;
+    F[1] = a3*F[1];
+    F[7] = a4*F[7] + t1;
+    //printF();
+    // Stage 4
+    t1 = F[0] + F[4];
+    t2 = F[0] - F[4];
+    F[0] = t1;
+    F[4] = t2;
+    F[6] = F[2] + F[6];
+    //printF();
+    // Stage 5
+    t1 = F[0] + F[6];
+    t2 = F[2] + F[4];
+    t3 = F[4] - F[2];
+    t4 = F[0] - F[6];
+    F[0] = t1;
+    F[4] = t2;
+    F[2] = t3;
+    F[6] = t4;
+    F[3] = F[3] + F[7];
+    F[7] = F[7] + F[1];
+    F[1] = F[1] - F[5];
+    F[5] = -F[5];
+    //printF();
+    // Final stage
+    f[0] = (F[0] + F[3]);
+    f[1] = (F[4] + F[7]);
+    f[2] = (F[2] + F[1]);
+    f[3] = (F[6] + F[5]);
+    f[4] = (F[6] - F[5]);
+    f[5] = (F[2] - F[1]);
+    f[6] = (F[4] - F[7]);
+    f[7] = (F[0] - F[3]);
+  }
+    static public void main(String s[]) {
+	idct2d test = new idct2d();
+	int i,j;
+	// Init to test transform in Wallace paper
+	for (i=0; i<8; i++)
+	  for (j=0; j<8; j++) trans[i][j]=0;
+	trans[0][0] = 240;
+	trans[0][2] = -10;
+	trans[1][0] = -24;
+	trans[1][1] = -12;
+	trans[2][0] = -14;
+	trans[2][1] = -13;
+	//First the row transforms
+	for (i=0; i<8; i++) {
+	  for (j=0; j<8; j++) S[j] = trans[i][j];
+	  test.calcIdct();
+	  for (j=0; j<8; j++) trans[i][j] = f[j];
+	}
+	for (i=0; i<8; i++) {
+	  System.out.println();
+	  for (j=0; j<8; j++) System.out.print((int) trans[i][j]+" ");
+	}
+	System.out.println();
+	System.out.println("Columns:");
+	//Now the column transforms
+	for (i=0; i<8; i++) {
+	  for (j=0; j<8; j++) S[j] = trans[j][i];
+	  test.calcIdct();
+	  for (j=0; j<8; j++) trans[j][i] = f[j]/4 + 128;
+	}
+	//Print it out!
+	for (i=0; i<8; i++) {
+	  System.out.println();
+	  for (j=0; j<8; j++) System.out.print((int) (trans[i][j]+0.5)+" ");
+	}
+	System.out.println();
+    }
+}
+.......
+....
+..
+.                                    C=H 19
+::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
+-------
+Part II: Bringing "True Color" images to the 64
+-------
+         by Adrian Gonzalez <adrianglz@globalpc.net>
+The Commodore 64 has a somewhat limited resolution, 16 predefined colors,
+and some very peculiar restrictions as to the number of different colors
+that can be placed next to each other.  These restrictions make drawing
+colorful pictures on the 64 a difficult task, and displaying full color
+photographic images almost impossible.
+I've been fascinated with bringing full color images to the c64 for a long
+time now.  My first image conversion project was a C program that could
+convert 16 color IFF pictures to koalapaint format.  I started work on this
+project somewhere back in 1992 or so.  It ran on the Amiga, and it was one
+of my first 'serious' C projects, so I was basically refining my C skills
+while doing it.  After some time I rewrote the converter completely and
+added support for Doodle, charsets and a few other things.
+Shortly after and with the help of a few friends on the net, I learned about
+a "magical" graphic mode called FLI.  Before I could do a FLI converter,
+however, somebody on irc #c-64 pointed me to a couple of 'amazing' images
+available on an ftp site that were supposedly in some new, colorful vic
+mode.  I was reluctant because I thought I had seen the best graphics a c64
+could do.  Boy was I wrong.  I was absolutely amazed by this 'new' VIC mode
+called IFLI.  Shortly thereafter the thought of doing an IFLI converter grew
+stronger and stronger in my head and the idea of a FLI converter practically
+vanished.  After several weeks of hard work I came up with my first attempt
+at IFLI conversion.  Several years passed until there was a reason to port
+this converter/renderer to the c64.  The reason, of course, was Steve Judd's
+JPEG decoder.
+My involvement with the JPEG project kind of started before Steve even
+started to work on it.  About two years ago, Nate Dannenberg asked me to
+do a renderer for his QuickCam interface.  I first came up with a 160x100
+renderer in 4 grays.  After that I came up with the 2 gray 320x200 hires
+renderer that was used first for Nate's Quick cam, and later modified to
+work with the first version of Steve's JPEG decoder.  This same renderer
+was later hacked into rendering drazlace grayscale images.
+The big challenge, of course, was porting the full color IFLI renderer to
+the c64.  I don't think I would've ever bothered if it wasn't for jpx.
+We faced the obvious restriction of the c64's limited RAM (The IFLI image
+itself takes up half the c64's memory!).  Things were tight, but it the end,
+it worked out just fine.  But how exactly does the renderer do it's magic?
+What's all that garbage on the screen while it's rendering?  Well, I'd
+like to start off by giving a quick explanation of what dithering is, and
+how the renderer uses a particular kind called Floyd-Steinberg dithering.
+Floyd-Steinberg Dithering
+-------------------------
+Dithering is the process of using patterns of two or more colors to
+trick the eye into seing a different color.  Let's say that you want to
+display 3 shades of gray with just two colors, you could have dither
+patterns such as:
+. . . .  * . * .  * * * *
+. . . .  . * . *  * * * *
+. . . .  * . * .  * * * *
+. . . .  . * . *  * * * *
+. . . .  * . * .  * * * *
+. . . .  . * . *  * * * *
+Where the dots (.) are black pixels and the asterisks (*) are white
+pixels.  If the pixels are small enough, the eye will see the middle
+pattern as a shade of gray.  This is the basic concept behind dithering.
+Floyd-Steinberg dithering is an 'error diffusion' dither algorithm.
+Basically that means that when drawing an image, if a color in the
+source image can't be matched with the available colors we have to use
+the closest available color.  After that we have to figure out the
+difference between the color we wanted to use (source image color) and the
+closest one we had available.  That difference, or error, has to be
+distributed (diffused) amongst adjacent pixels.
+For example, imagine we have a video chip that can only display black and
+white pixels.  Black pixels would be 0% brightness and white pixels 100%
+brightness.  Let's say we want to use this chip to display an image with 100
+shades of gray.  We can store the image as an array of numbers from 0 to 99,
+where 0 represents 0% brightness and 99 represents 100% brightness.  A small
+part of our test image could look something like this (5 x 2 pixel chunk of
+the image):
+  25  45  75  99
+  50  80  30  10
+Without dithering, the best we could do is pick the color closest to the one
+we want to display, so we'd end up with something like:
+  00  00  99  99
+  99  99  00  00
+Where 00 is black and 99 is white.  Basically, any pixels with brightness
+greater or equal to 50 were converted to white (99) and the rest were
+converted to black (00), since those are the only two colors our hypotetical
+video chip can display.
+With Floyd-Steinberg error diffusion dithering we also plot the closest
+color we have, but instead of just moving on to the next pixel we calculate
+by how much we were off (error) and diffuse that amount among adjacent pixels.
+Going back to our test image, the first pixel is completely black so we can
+display it right away without incurring any error, because we matched the
+color exactly.  The second pixel (25) is dark gray so we plot it with the
+closest color we can, in this case, black (00).  We then proceed to compute
+the error, which is equal to the color we wanted (25) minus the color we
+had available (00), so for this pixel, the error is +25.  We then diffuse
+the error (+25) to the adjacent pixels.  F-S dithering uses the following
+distribution:
+       C.Pix  7E/16
+E/16  5E/16  3E/16
+Where C.Pix is the current pixel, and E is the error.  Basically that
+means, add seven sixteenths of the error to the pixel to the right of the
+current pixel, five sixteenths of the error to the pixel below the current
+pixel, etc.
+So in our example, we wanted to plot a dark gray pixel (25) but we only
+had black available (00), so the error is +25.  So then we have (rounded
+off)
+(7/16)E = 11
+(5/16)E = 8
+(3/16)E = 5
+(1/16)E	= 2
+When we add this to the original image buffer, we get:
+(Original)
+  CP >45< 75 100
+  50  80  30  10
+(Diffused)
+  CP >56< 75 100
+  58  85  30  10
+Again, CP stands for 'Current pixel'.  After doing these calculations, we're
+ready to move on to the next pixel.  You'll notice that the third pixel
+(originally 45) would have been plotted as black but now, because of the
+error diffusion, the new value is 56 so we'll plot it as white, and the
+error will be 56-99 = -43.  We then repeat the procedure:
+(7/16)E = -19
+(5/16)E = -13
+etc
+And adjust the buffer accordingly.  Repeat this procedure for each pixel,
+processing each scanline from left to right and scanlines from top to
+bottom and the result is a nice looking dithered image.  Note that errors
+can be positive or negative, so we should prepare for a case such as this:
+00 00
+00 00
+Get the 55, plot it as white, and we have an error of -44, so that means
+that our buffer needs to be able to handle negative values as well.  After
+difusing, the buffer would look like:
+ CP -20  00
+-14  -8  00
+Note also that the 1E/16 was discarded because we're at the left edge of
+the screen.  The same overflow condition applies to the opposite case:
+  99  99
+  99  99
+The error +44 will make the values of adjacent pixels greater than 99,
+which is the maximum that can be displayed.  The buffer needs to be able to
+hold values large enough to accomodate for this.
+Now let's assume our hypothetical video chip manufacturer came up with a new
+video chip that can display 4 grays: black (0), dark gray (33), light gray
+(66), and white (99).  If we want to plot an image with 100 shades of gray
+we will still always plot the closest color we can, i.e. 0-16 will be
+plotted as 0 (black), 17-49 as 33 (dark gray), etc.  The error will be
+positive or negative depending on whether we're under or over the color we
+wanted to plot.  For example, the color 15 would be plotted as 0 (black),
+with an error of +15, while the color 20 would be plotted as 33 (dark gray)
+with an error of -13.  And I think I've managed to confuse everybody
+including myself, but if you read this paragraph over, it should make at
+least some sense.  Always remember the error is computed as the color we
+want minus the color we have.
+As if things weren't fun enough, we can also apply this to a full color
+(RGB) display where we have 3 buffers, one for each primary color (red green
+and blue).  Each buffer contains the corresponding levels of each primary
+color for a given pixel.  Everything works exactly the same, except now
+colors are specified as triplets, for example:
+   R   G   B
+(  0,  0,  0) black
+( 99,  0,  0) bright red
+( 99, 99,  0) bright yellow
+( 99, 99, 99) white
+When we plot a color we now have to compute three errors, one for each
+primary color component.  Each component is used to figure out the error for
+its corresponding buffer.  For example, let's say we want to draw a red
+pixel (80, 0, 0) but our video chip can only display bright red (99, 20, 0).
+The error would still be computed as the color we want minus the color we
+can display:
+We want:
+r1=80, g1= 0, b1=0
+We have:
+r2=99, g2=20, b2=0
+The error would be: (r1-r2, g1-g2, b1-b2) = (-19, -20,  0).  After computing
+the error we proceed to distrubute it in the same fashion as before, except
+that we now have three image buffers, each with its own error to be
+distributed among its adjacent pixels.  The best way to visualize this is to
+imagine you're displaying 3 independent images, each with it's own error.
+In the previous example, we would diffuse the -19 in the red buffer, the -20
+in the green buffer and the 0 in the blue buffer.
+With grayscale images, finding which shade of gray was the closest to the
+one we wanted to display was pretty straightforward.  With full color
+images, the way to figure out the closest color changes a little bit.  In
+order to find which of our available colors is the closest match for the
+color we want to display, we need to calculate the 'distance' from the color
+we want to each of the colors we have available and use the one with the
+shortest distance.  To do this you can imagine the RGB color space as a
+cube, with the R, G, and B as each of the 3 axis.  The origin (0,0,0) is
+black, and the corner opposite to the origin (99,99,99) is white, so
+figuring out the distance between two colors is as simple as figuring out
+the distance between two points in space:
+color1 = (r1, g1, b1)
+color2 = (r2, g2, b2)
+d = sqrt( (r1-r2)^2 + (g1-g2)^2 + (b1-b2)^2 )
+Let's say that our video chip can display 5 colors:  black, red, green, blue
+and white.  The RGB triplets for these colors would be:
+( 0, 0, 0): Black
+(99, 0, 0): Red
+( 0,99, 0): Green
+( 0, 0,99): Blue
+(99,99,99): White
+Let's also say we want to find out which of these is the closest match for
+the color (50,80,10).  We have to compute the distance between this color
+and all of our 5 available colors and see which one is the closest.  The
+calculations would be as follows:
+Black:
+sqrt( ( 0-50)^2 + ( 0-80)^2 + ( 0-10)^2 ) = 94.87
+Red:
+sqrt( (99-50)^2 + ( 0-80)^2 + ( 0-10)^2 ) = 94.35
+Green:
+sqrt( ( 0-50)^2 + (99-80)^2 + ( 0-10)^2 ) = 54.42
+Blue:
+sqrt( ( 0-50)^2 + ( 0-80)^2 + (99-10)^2 ) = 129.70
+White:
+sqrt( (99-50)^2 + (99-80)^2 + (99-10)^2 ) = 103.36
+In this case, the color with the shortest distance is Green (54.42).  Note
+that we're not interested in knowing the exact distance, just knowing which
+color has the smallest distance, so it's safe to toss out the square root
+in order to things faster.  If we don't calculate the square root we end up
+with the following squared distances:
+Black:  9000
+Red:    8901
+Green:  2961
+Blue:  16821
+White: 10683
+Of course, Green still has the smallest distance^2, and we're saved from
+performing a potentially troublesome (and slow) calculation.
+Based on the previous explanation, we're ready to move on to implementing
+Floyd-Steinberg dithering on the C64.  We will need to have the RGB values
+for each C64 color handy in order to be able to compute the error and the
+closest colors for each pixel we want to draw.
+This article would probably end at this point if the C64 would let us
+choose any of the 16 colors for any pixel on the screen, but we're not quite
+that lucky.
+Multicolor Bitmap Mode
+----------------------
+The VIC-II video chip on the C64 has somewhat strict color limitations.  In
+multicolor bitmap mode, the screen has a resolution of 160x200 and it's
+divided into 4x8 pixel 'cells'.  Each of these cells can have up to 3
+different colors out of the C64's 16 colors plus one background color common
+to all cells on the screen.  If we wanted to display a 4x8 cell like this:
+  4  4  3
+  4  3  3
+  3  3  3
+  3  3  0
+  3  3  0
+  3  3  3
+  1  3  3
+  1  1  3
+We could choose color 3 as the background color common to all cells, and the
+colors 0, 1 and 4 as the colors available to this particular cell (called
+foreground, multicolor 0, and multicolor 1).  We can't display any
+additional colors on this cell.  This makes multicolor bitmap mode a very
+tough choice for displaying true color images.
+FLI Mode
+--------
+Flexible Line Interpretation (FLI) mode is a software graphics mode in which
+the video chip is tricked by software in order to achieve higher color
+placement freedom.  It is basically the same as multicolor bitmap mode,
+except that each 4x8 cell is further divided into eight 4x1 cells.  Each
+x1 cell can have 2 completely independent colors, 1 color common to the
+entire 4x8 cell and one background color common to the entire image (some
+implementations of FLI change the background color on every scanline as well).
+One small downside of FLI mode is that the leftmost 3 columns of cells are
+lost due to the trickery used to get the video chip to fetch color data on
+every scanline.  This means that the effective display area is reduced from
+x200 to 148x200.
+IFLI Mode
+---------
+IFLI mode or "Interlaced" FLI mode is basically two FLI images alternating
+rapidly.  The C64 has a fixed vertical refresh rate of 60 frames per second
+for NTSC models and 50 frames per second for PAL models.  This means that
+the screen is redrawn 60 times per second on NTSC units and 50 times per
+second on PAL units.  IFLI alternates between two FLI images, displaying
+each for 1/60th of a second (1/50th for PAL), giving the illusion of a
+single blended image with more than 16 colors.  One of the biggest
+advantages of IFLI mode is that one of the FLI images is shifted one hires
+pixel (1/2 of a multicolor pixel) to the right to give a pseudo 320x200
+hires effect.
+For example, let's say a little part of the images looks like this:
+(11 = one multicolor white pixel, 33 = one multicolor cyan pixel, etc)
+Image1
+11335577
+Image2
+ 22446688
+Alternating these two would give an effect that looks like:
+12345678
+Except that the colors would also mix and blur slightly, giving the illusion
+of more colors than the VIC-II can actually display.  Of course, some color
+combinations work better than others.  Don't expect to mix black and white
+and get a nice looking shade of gray (you'll get a very flickery shade of
+gray because of the alternation).
+The renderer in jpz doesn't attempt to mix colors, mainly because I was
+never happy with the results I got by doing that.  Instead, it treats the
+IFLI display as a 'true' 296x200 display capable of displaying any single
+one of the c64's 16 colors in any position.  Note that the 3 column 'bug'
+also applies to IFLI, so the resolution is 296x200 instead of 320x200.
+The color restrictions are somewhat more complex in IFLI mode.  The renderer
+in jpz treats the display as if it was made up of 8x8 cells, with each cell
+divided into eight 8x1 cells, and each of those divided into two 4x1 cells
+(fun, huh?).  To illustrate this better, look at the following 8x8 cell
+sample:
+A I A I A I A I
+B J B J B J B J
+C K C K C K C K
+D L D L D L D L
+E M E M E M E M
+F N F N F N F N
+G O G O G O G O
+H P H P H P H P
+The odd columns belong to a 4x8 cell in the first FLI image and the even
+columns belong to a 4x8 cell in the second FLI image like this:
+Image 1    Image 2
+AAAA       IIII
+BBBB       JJJJ
+CCCC       KKKK
+DDDD       LLLL
+EEEE       MMMM
+FFFF       NNNN
+GGGG       OOOO
+HHHH       PPPP
+Remember the two images are offset by half a multicolor pixel to give the
+pseudo-hires effect.  As for the color restrictions, each 4x1 cell on each
+image has 2 completely independent colors, but each 8x8 cell (the
+combination of the 4x8 cells from the two images) shares one color, and the
+entire image shares one background color.
+The renderer in jpz is divided into two parts.  The first part takes the
+source RGB image and remaps it to the c64's colors, using Floyd-Steinberg
+dithering as described in the first part of this article.  This part outputs
+an array of numbers, each number corresponds to a c64 color.  The second
+part of the renderer takes this array of c64 colors and displays it in IFLI
+mode as best as it can, taking into consideration the color placement
+limitations mentioned above.
+The second part of the renderer works with blocks of 8x8 pixels and follows
+these steps:
+) Choose one color as common to the entire 8x8 cell
+) Choose two colors for each 4x1 cell
+) Render the 8x8 block (as two 4x8 cells, one on each FLI image)
+In step one the renderer has to determine which one of the C64's 16 color
+would be the most helpful when chosen as common to the 8x8 block.  This
+means that the common block color should be chosen to aid in 4x1 cells with
+more than 2 different colors (remember that 4x1 cells only have 2 completely
+independent colors for them).  If we wanted to display a 4x1 cell like
+15 12 12
+We have two independent colors for the cell, which could be chosen as 1 and
+.  We need either the common 8x8 block color or the background color to be
+so we can correctly display this 4x1 cell.  So how do we decide?  We
+create a histogram!
+A histogram is nothing more than a count of how many pixels of each color we
+have in a particular area (in this case an 8x8 block).  Note that we only
+want to count the cases in which the common block color would actually be
+helpful for displaying a particular 4x1 cell.  This is easier to explain
+with an example 8x8 block:
+1 1 1 1 1 1 1
+1 1 1 1 1 1 1
+1 1 1 1 1 1 1
+1 1 1 1 1 1 1
+1 1 1 1 1 1 1
+1 1 1 1 1 1 1
+1 1 1 1 1 1 1
+1 3 1 3 1 4 1
+If we count all the colors in this block we would find 60 ones, one 2, two
+'s, and one 4, and we would decide that 1 is the best choice as a common
+color for the 8x8 block because it's the most 'popular' color.  A closer
+look reveals that this block will be rendered as the following 4x8 blocks:
+Image1  Image2
+    1111
+    1111
+    1111
+    1111
+    1111
+    1111
+    1111
+    1111
+Note that in the last 4x1 cell of image 1 we have 3 different colors.  We
+have the ability to choose only two individual colors for this 4x1 cell, so
+if we choose 2 and 3, we won't be able to display 4 and our common 8x8 block
+color can't help us either.  The best solution in this case is to _not_
+count 4x1 cells with 2 or fewer different colors.  This means that the only
+cell we would count in our histogram is the last 4x1 cell in image 1.  So
+the new histogram would be one 2, two 3's, and one 4.  We would proceed to
+choose 3 as the common 8x8 block color and this allows us to render the
+entire 8x8 block without a single problem!
+In theory, the same should be done for the background color, in order to
+choose the best background color for the picture we're rendering, but that
+would mean that we have to do a histogram for the entire image before
+starting to render it.  In practice, we don't have enough memory on the C64
+to do this while reserving enough memory for an IFLI display (and decoding a
+JPEG), so we choose black as the default background color.
+The second step in the process is to choose two colors for each 4x1 cell.
+This is done with the same histogram technique described earlier, except we
+have to take into consideration the color we picked as common to the entire
+x8 block so we don't repeat any colors and have the best chances of
+representing the original image as closely as possible.   Basically, a
+histogram is made for each 4x1 cell, and the top two most popular colors are
+picked, assuming they're not the same as the background color (black) or the
+common 8x8 block color.  For example, let's say the common 8x8 color is
+white (1) and we have a 4x1 cell that looks like this:
+The histogram would be:  two pixels of color 2 (red), one pixel of color 1
+(white) and one pixel of color 3 (cyan).  In this case, since white is
+already our common 8x8 block color, we skip it and pick colors 2 and 3 as
+our 4x1 cell colors.  The same skipping is done with black pixels because
+black is already available as the background color.
+The third and last step is to render the actual image with the correct
+bitpairs.  As you may know, multicolor images sacrifice half the horizontal
+resolution in favor of more colors.  Basically, bits are paired up to have 4
+possible combinations:
+: Background color (black in our case)
+: Upper nybble of screen memory   (4x1 cell color #1)
+: Lower nybble of screen memory   (4x1 cell color #2)
+: Video matrix color nybble (Common 8x8 block color)
+All that's left to do is to output the corresponding bit pairs in each 4x1
+cell to match the colors in the source (remapped) image as close as
+possible.
+Depending on the complexity of the source image, there can be a few or a
+lot of 4x1 cells where we can't match all the colors.  Remember we only have
+completely independent colors for each 4x1 cell, and a cell can
+potentially have each pixel be a different color.  When this happens, the
+best we can do is approximate the colors we can't match with the ones we
+have available.  The renderer does this with a color closeness lookup table
+to avoid having to compute the color distances in realtime.
+The table is basically a list of what colors are most similar to any
+particular c64 color, ordered from the most similar to the least.  Let's say
+we want to plot the color white (1) but none of our bitpairs for the current
+cell can represent it.  We have to look up white in our table and get the
+first color closest to it.  If that color isn't available either, we will
+fetch the next closest color from the table and try again untill we find a
+match.
+It is worth mentioning that due to the memory limitations of the C64 the
+bitmaps are stored in memory in 'packed' form while rendering.  If you go
+back to the brief description of FLI mode, you'll remember that the leftmost
+char columns were lost due to VIC chip limitations.  When rendering, the
+bitmaps are stored contiguously in memory, without these 3 char block gaps
+in order to have enough room to render the entire image.  After the entire
+image is rendered, it is 'unwound' by a small routine and then finally
+displayed in its full IFLI glory.  In the stock version of the renderer you
+can see this 'unwinding' take place right before the image is displayed.
+Also, the colorful blocks on the screen while the image is being rendered
+are the actual buffers where the floyd-steinberg dithering is taking place
+(note that all of this is invisible in the SCPU version due to the memory
+mirroring optimizations provided by the hardware).
+Well, that basically wraps up this article.  I hope that it will give the
+reader an idea of the enormous amount of calculations that have to take place
+in order to be able to convert the images to a format suitable for viewing
+on our beloved C64.  I also hope it explains the basic principles behind the
+rendering of these images, and why it takes so long for a stock system to
+display them.
+.......
+....
+..
+.                                    - fin -
+</code>