The GTE Blitter

The blitter code within GTE is the heart of the engine. Without a fast blitter, there is no hope of the library becoming useful for writing any sort of game which requires fluid animation. Since the Apple IIgs computer does not contain any sort of hardware support for scrolling or sprites, we are required to be a bit more innovative.

Stack Remapping and the PEA Instruction

A very good initial source of inspiration can be found in IIgs TN #70: Fast Graphics Hints. This Tech Note describes techniques for performing fast animation on the IIgs. The note gets straight to the point -- mapping the stack onto the graphics screen is the fastest way to achieve animation. This is effected in two steps.

  1. Turn on shadowing of Bank $01 -> $E1 by setting bit 3 of the shadow register ($C035) to zero.
  2. Select auxilliary RAM card write by writing to $C005 or setting the appropriate bit at $C068.
This little bit 'o magic forces any writes which would have been written to Bank $00 to be written to Bank $01 instead. This, combined with the two facts that
  1. The Stack lives in Bank $00, and
  2. The graphics screen lives in Bank $01
mean that we can now write to the graphics screen using the stack!

The Field of Dreams

The promise of being able to pump pixels at a rate of 1.25 cycles per pixel is too great a siren song to turn away. However, there are several practical issue which must be dealt with before we can adapt the technique to be flexible enough for the purposes of GTE.

First, consider one of the primary goals of GTE -- to produce full-screen scrolling. Generally speaking, we may need to write a unique value to every word on the graphics screen. Since the super hi-res screen takes up 32000 bytes, we need to use a total of 16000 PEA instructions. This works out to needing 48000 bytes of RAM; well withing the 64K bank limit of the IIgs. However, life is never as simple as this.

Since GTE is tile-based, we assume that we can only draw full tiles into this large firld of PEA instructions. Because the screen may be scrolled smoothly, a partial tile can be seen at any edge of the graphics screen (top, left, bottom and right). Thus, if we need W tiles to cover the width of the physical screen and H tiles to cover the height, our PEA field must be able to hold W+1 and H+1 tiles. The largest tile size GTE supports is 16x16 (we'll see why in a bit), so the effective storage needed for the PEA fields grows to 54432 bytes. There is still one small correction to make. Since the height of the super hi-res screen (200 lines) is not evenly divisiblee by 16, we actuall have to pad the PEA field with an additional 8 lines. This brings the total space for the PEA field to 56448 bytes.

So, our PEA field with a large number of PEA instructions will fit comfortably into one bank of memory. One small problem remains. As it stands, there is no way to draw only a range of values. If we execut to code, it will fill up the entire screen buffer (and then some!) with data, which will inevitably lead to a crash. We need some way of controlling the buffer on a line-by-line basis. This is accomplished by adding a JMP instruction after every 84 PEA instructions which jumps to the beginning of its 84 instruction range. This affords us two benefits -- first, we have limited the data which can be blitted to one physical line of the screen and, second, we have created a circular code buffer which we will leverage to great effect.

As it stands, if we jump into the circular code buffer that blits the data for a single line, we will repeat is ad infinitum. There needs to be a way to gracefully exit after a fixed number of words have been written to the graphics screen. The solution to this problem is self-modifying code. The GTE engine computes both the address to enter the code buffer and the address at which it should be existed. Then, a branch instruction replaces the PEA at the appropriate address. This branch jumps to a point immediately before or after the code buffer, where a Indirect Jump Long instruction returns control to the blitter dispatch loop. the code for the first line of the blit field is shown below. This final version of the blit field takes up a total of 224 * (84*3 + 6) + 7 = 57799 bytes.

000: 4-byte return address
004: JML (0000)
007: PEA $A5A5
010: PEA $5A5A
.
.
.
085: PEA $A5A5
088: PEA $5A5A
091: JMP $0007
094: JML (0000)

From the standpoint of efficiency, it is desirable to be able to patch the blit field with a BRA instruction since, at only 2 bytes, it can be copied in via a single 16-bit STA instruction. One important detail needs to be checked, though. The BRA instruction can only branch to reltive addresses in the range [-128, +127] bytes from address of the following instruction. Since each line of the PEA fields contains 3*84 = 252 bytes, and we need to branch outside this range, things are looking a bit tight.

The furthest offset which is able to branch to the JML instruction preceeding the line can be computed by counting up the bytes until we reach the value $80. First, we have to cover 2 bytes for the BRA instruction itself and 3 bytes to get to the opcode of JML which accounts for a total of 5 bytes. This gives us a remaining range of -123 bytes -- conveniently a multiple of 3 -- which can cover 41 PEA instructions. For the forward case, we need to clear one extra byte to get to the next PEA and the JMP instruction at the end of the line, so we can clear up to 127 - 3 - 1= 123 bytes which covers 41 PEA instructions.

Since there are 84 PEA instruction in the field it appears that we are in trouble, however two subtle facts saves us. First, the BRA instruction itself occupies a PEA instruction, so we really only have to cover a range of 83 instructions. Second, we will not try to brancj both forwards and backwards from the same location. The furthest forward jump will be 1 PEA advanced from the furthest backward jump. Subtracting this off our count and we find our effective range is only 82 PEA instructions, which is exactly what the BRA instruction can cover! The code from above is repeated with the appripriate BRA instructions writted on the side. In addition the extreme cases analyzed above are shown.

000: 4-byte return address
004: JML (0000)
007: PEA $A5A5    BRA -5
010: PEA $5A5A    BRA -8
.
.
131: PEA $5A5A    BRA -128
134: PEA $A5A5    BRA +127
.
.
256: PEA $A5A5    BRA +7
259: PEA $5A5A    BRA +4
262: JMP $0007
265: JML (0000)

Incidentally, this is why GTE only supports tiles up to 16x16 pixels. A larger tile size would necessitate a wider PEA field which would make the entire field larger than 64K in size and force each line to be too large to span with BRA instruction. Since 16x16 is a reasonable size for the reolution of the IIgs, that's what GTE supports.

The Look is Layered

The blit field descriped in the previous section is a very nice, flexible construct. It is able to move data quickly to the graphics screen and allows arbitrary widths to be blitted at any given time. However, GTE leverages a unique aspect of the IIgs memory layout in order to support a second layer with very little performance degredation.

Recall from the first section that we turned on the Auxilliary Memeory Write flag in order to push data onto the graphics screen. We did not force the Read flag to take any particular value. In fact, if we turn off the Aux. Mem. Read flag, we achieve the efffect of reading from Bank $00 while writing to Bank $01. This is a useful trick indeed! Luckily, the 65816 CPU provides an instruction tailor made to take advantage of this situation -- the PEI instruction. PEI stands for Push Effective Indexed and has the effect of reading a 16-bit value from the Direct Page and pushing it on the stack. This instruction is usually used for pushing local varables on the stack when calling functions, but we are going to use it a bit differently.

Assuming we have some graphics data in Bank $00, by setting the Direct Page to point to that data we can instersperse PEI instructions into our blit field in order to copy both the sataic data from the PEA instruction, but also data from bank $00 on the graphics screen. A bit of code might look like this:

004: JML (0000)
007: PEA $A5A5    
010: PEA $5A5A    
013: PEI 00
015: NOP
016: PEI 02
019: NOP
022: PEA $1234
.
.

Notice that the PEI instructions need to be padded with a NOP in order to maintain the 3-bytes-per-instruction requirement. There is one tricky point about this technique that can be a bear to deal with. Imagine that the blitter begins by jumping into the blit field at address 007, the first PEA instruction. The data on the graphics screen (from right-to-left) is:

 ....  34    12   [02L] [02H] [00L] [00H]  5A    5A    A5    A5
where [--L] and [--H] represent the low and high byte of the contents of a direct page location. Now pretend that the screen is scrolled to the right by one word. This means that the blitter code will be entered at address 007 instead of 004 and the data on the graphics screen becomes
 ....  --    --    34    12   [02L] [02H] [00L] [00H]  5A    5A
Notice that the data contained in Bank $00 via the direct page has moves as well. Since we want to create the illusion of two independent layers this is not desirable at all. The way out of this problem is to utilize the fact that the direct page can be set to any addess and, when the playing field is scrolled, to move the direct page an equal amount in the opposite direction. In this case, we would subtract 2 from the direct page which would give effective values of $00 and -$02 to the direct page offsets.
 ....  --    --    34    12   [00L] [00H] [-2L] [-2H]  5A    5A
This is the correct effect since it exposes new data over the "holes" in the foreground layer and keeps existing data in place.

Keeping the Music Alive

GTE makes the trade-off of increasing the per-line overhoead in order to provide low-latency interrupt service. Interrupts are only every disabled long enough to draw a single line of the blitter field to the graphics screen. Even for a full-screen transfer, the interrupts are only disabled for around 500 microseconds, which easily keeps up with the 16.66 millisecond timer used by most music trackers.

Looking at the core of the aligned, no-sprite blitter

NoSpriteSetup anop
              lda    #ReturnAddress
              sta    BlitRtnAddr
              clc

NoSpriteLoop  anop
              lda    [PEA_Ptr],y           ; address of left edge
              adc     #offset              ; add offset for right edge
              sta    BlitEntryAddr         ; the is where to begin

              lda    #endaddr              ; (BasePtr),Width_In_Bytes
              adc    [PEA_Ptr],y
              sta    InstPtr               ; complete the pointer
              lda    [InstPtr]             ; load the PEA instruction
              tax                          ; cache in X-register
              lda    #BRAInst              ; patch with a BRA
	      sta    [InstPtr]

	      sei                          ; disable interrupts
	      lda    StackPtr,y            ; get the starting screen addr
	      tcs

	      lda    [DirectPagePtr],y     ; adjust the direct page
	      sec
	      sbc    directPageAdjust
	      tcd

	      lda    #R0W1                 ; Read Bank 0, Write Bank 1
	      sta    >$E0C068
	      jml    (BlitEntryAdd)        ; Draw the line

ReturnAddress anop
              lda    #SavedDP              ; restore the Direct Page
	      tcd

	      lda    #SavedFlags           ; return to normal RAM config
              sta    >$E0C068
	      
	      lda    #SavedSP              ; restore the Stack Pointer
	      tcd
	      cli                          ; enable interrupts

	      txa                          ; re-patch the PEA instruction
	      sta    [InstPtr]
	      
	      iny                          ; advance to the next line
	      iny

	      ldx    JumpTable,y
	      jmp    (NoSprite,x)

	      dc     i2'BlitterDone'
NoSprite      dc     i2'NoSpriteLoop, SpriteSetup'
The interrupts are disabled just before the stack is mapped to the graphics screen and immediately aftter restoring the stack to its correct location in Bank $00. The only overhead iin this code is the addition of the cli/sei pair, which is very small.

When sprites are added to the mix there are two changes to the blitter code shown above. First, we need to synchronize with the VBL signal to avoid flicker caused by the graphics screen being displayed after drawing the background but before the sprites are overlaid. Second, we modify the code to draw the spries immediately after the background has been drawn. The code which synchronizes the the VBL looks like

SyncVBL       anop
              lda   >$E0C02E               ; load the counter value
	      and   #$80FF                 ; mask out the VBL bits
	      asl   a                      ; shift the word around
	      adc   #0			   ; move MSB -> LSB
	      sec
	      sbc   #$00FA                 ; translate to range [$0 - $105]
	      cmp   mintbl,y
	      bcc   SyncOut
	      cmp   maxtbl,y
	      bcc   SyncVBL
SyncOut	      anop
All the code before the first CMP instruction simply extract the current line the the VBL gun is scanning. The actual work is done by comparing the current line to a point of table values which define a range for which we cannot draw the current line. GTE is currently set to have an 11-line delay, which means if the screen is currently painting any line that is within an 11 line range preceeding the current line, then we need to wait until it passes.

Playing Hot Potato

Sprites can be given a priority which will insert them into the front or rear of the sprite queue when they are encountered. Thus, we have specialized list manipulation routines depending on if the sprite is inserted into an empty list, the front or the rear. The Head and Tail insertion routines both take exactly the same amount of time.

HeadInsertion anop

              ldy    cold_list,x             ; fetch the slot index
              ldx    sprite_slots+ADDR,y     ; get the DP location in Bank $01

	      lda    #$FFFE
	      sta    >$010000+PREV,x         ; curr->prev = nil

	      lda    active_head             ; curr->next = active_head
	      sty    active_head
	      sta    >$010000+NEXT,x

	      tay
	      ldx    sprite_slots+ADDR,y     ; get address of old slot

	      lda    active_head
	      sta    >$010000+PREV,x         ; curr->next->prev = curr
	      tay

	      rts
TailInsertion anop

              ldy    cold_list,x             ; get the new tail ...
	      ldx    sprite_slits+ADDR,y     ; .. and addreess of new slot
	      lda    active_tail             ; save the old tail ...
	      sty    active_tail	     ; .. and set the new tail

	      sta    >$010000+PREV,x	     ; curr->prev = old tail

	      tay
	      lda    #$FFFE
	      sta    >$010000+NEXT,x         ; curr->next = nil

	      ldx    sprite_slots+ADDR,y     ; get the slot address of old slot
	      lda    active_tail             ; get the current slot
	      sta    >$010000+NEXT,x         ; curr->prev->next = curr
	      tay

	      rts
FirstInsertion anop
              tya                            ; Y-reg = $FFFE on entry
	      ldy    cold_list,x             ; fetch the slot index and get
	      ldx    sprite_slots+ADDR,y     ; the DP location in Bank $01

	      sta    >$010000+PREV,x         ; curr->prev = nil
	      sta    >$010000+NEXT,x         ; curr->next = nil

	      tya
	      sta    active_head             ; head = curr
	      sta    active_tail             ; tail = curr

	      rts

Wrapping It Up

Future Directions

Overall, the core of GTE is a highly optimized, flexible and expandible piece of code. That does not mean it's perfect. There are two ascpects of the blitter which would be nice to fix. If a technique could be developed to work around one or both of these issues, please let me know, as I would consider this library 100% complete at that point.

The first, most tractable issue related to the sprites. Quite a bit of work is done in order to synchronize the sprite code the VBL hardware and there is a fair bit of overhead dispatching the sprites to the screen. Linked list manipulations, Bank $01 memory blocks, dispatch tables. All of these are overhead. Ideally I would like to have a zero-overdraw sprite system which would allow each line of the screen to be drawn without synchrinizing with the VBL. Any byte of data whitten to the graphics buffer is written in its final form. I don't believe this is feasible, simply because of how multiple layers are implemented.

A slightly less ambitious fix is to find a way to efficiently merge the PEA/PEI blitter fields with a parallel sprite field. The core idea here is that all the sprites that will be drawn on-screen are drawn into an off-screen buffer. Then, each line of this off-screen buffer is blitted on top of the graphics screen once. This has an advantage of moving code out of the inner loop of the blitter, plus overlapping sprites will only write their data to the graphics screen once. A drawback is that we will effectively be drawing each sprite twice, however I believe the the savings in overhead will more than make up for this penalty.

The real implementation detail, is that the "sprite field" must be sparse so that only those bytes which need to be modified are processed, plus it must be structued as executable code, as well. This is indeed a tricky problem.

The second issue with the blitters is that the parallax layer has only word-sized granularity. By that, I mean every instruction in the PEA/PEI field must be one of those two instructions. And since each instruction writes 4 pixels to the screen, we cannot support pixel-perfect parallax scrolling. I would welcome a solution which could even provide byte-sized granularity.