The GTE Blitter
- Stack Remapping
- Field of Dreams
- The Look is Layered
- Keeping the Music Alive
- Playing Hot Potato
- Wrapping It Up
- Future Directions
The blitter code within GTE is the heart of the engine. Without a fast blitter, there is no hope of the library becoming useful for writing any sort of game which requires fluid animation. Since the Apple IIgs computer does not contain any sort of hardware support for scrolling or sprites, we are required to be a bit more innovative.
Stack Remapping and the PEA Instruction
A very good initial source of inspiration can be found in IIgs TN #70: Fast Graphics Hints. This Tech Note describes techniques for performing fast animation on the IIgs. The note gets straight to the point -- mapping the stack onto the graphics screen is the fastest way to achieve animation. This is effected in two steps.
- Turn on shadowing of Bank $01 -> $E1 by setting bit 3 of the shadow register ($C035) to zero.
- Select auxilliary RAM card write by writing to $C005 or setting the appropriate bit at $C068.
- The Stack lives in Bank $00, and
- The graphics screen lives in Bank $01
The Field of Dreams
The promise of being able to pump pixels at a rate of 1.25 cycles per pixel is too great a siren song to turn away. However, there are several practical issue which must be dealt with before we can adapt the technique to be flexible enough for the purposes of GTE.
First, consider one of the primary goals of GTE -- to produce full-screen scrolling. Generally speaking, we may need to write a unique value to every word on the graphics screen. Since the super hi-res screen takes up 32000 bytes, we need to use a total of 16000 PEA instructions. This works out to needing 48000 bytes of RAM; well withing the 64K bank limit of the IIgs. However, life is never as simple as this.
Since GTE is tile-based, we assume that we can only draw full tiles into this large firld of PEA instructions. Because the screen may be scrolled smoothly, a partial tile can be seen at any edge of the graphics screen (top, left, bottom and right). Thus, if we need W tiles to cover the width of the physical screen and H tiles to cover the height, our PEA field must be able to hold W+1 and H+1 tiles. The largest tile size GTE supports is 16x16 (we'll see why in a bit), so the effective storage needed for the PEA fields grows to 54432 bytes. There is still one small correction to make. Since the height of the super hi-res screen (200 lines) is not evenly divisiblee by 16, we actuall have to pad the PEA field with an additional 8 lines. This brings the total space for the PEA field to 56448 bytes.
So, our PEA field with a large number of PEA instructions will fit comfortably into one bank of memory. One small problem remains. As it stands, there is no way to draw only a range of values. If we execut to code, it will fill up the entire screen buffer (and then some!) with data, which will inevitably lead to a crash. We need some way of controlling the buffer on a line-by-line basis. This is accomplished by adding a JMP instruction after every 84 PEA instructions which jumps to the beginning of its 84 instruction range. This affords us two benefits -- first, we have limited the data which can be blitted to one physical line of the screen and, second, we have created a circular code buffer which we will leverage to great effect.
As it stands, if we jump into the circular code buffer that blits the data for a single line, we will repeat is ad infinitum. There needs to be a way to gracefully exit after a fixed number of words have been written to the graphics screen. The solution to this problem is self-modifying code. The GTE engine computes both the address to enter the code buffer and the address at which it should be existed. Then, a branch instruction replaces the PEA at the appropriate address. This branch jumps to a point immediately before or after the code buffer, where a Indirect Jump Long instruction returns control to the blitter dispatch loop. the code for the first line of the blit field is shown below. This final version of the blit field takes up a total of 224 * (84*3 + 6) + 7 = 57799 bytes.
000: 4-byte return address 004: JML (0000) 007: PEA $A5A5 010: PEA $5A5A . . . 085: PEA $A5A5 088: PEA $5A5A 091: JMP $0007 094: JML (0000)
From the standpoint of efficiency, it is desirable to be able to patch the blit field with a BRA instruction since, at only 2 bytes, it can be copied in via a single 16-bit STA instruction. One important detail needs to be checked, though. The BRA instruction can only branch to reltive addresses in the range [-128, +127] bytes from address of the following instruction. Since each line of the PEA fields contains 3*84 = 252 bytes, and we need to branch outside this range, things are looking a bit tight.
The furthest offset which is able to branch to the JML instruction preceeding the line can be computed by counting up the bytes until we reach the value $80. First, we have to cover 2 bytes for the BRA instruction itself and 3 bytes to get to the opcode of JML which accounts for a total of 5 bytes. This gives us a remaining range of -123 bytes -- conveniently a multiple of 3 -- which can cover 41 PEA instructions. For the forward case, we need to clear one extra byte to get to the next PEA and the JMP instruction at the end of the line, so we can clear up to 127 - 3 - 1= 123 bytes which covers 41 PEA instructions.
Since there are 84 PEA instruction in the field it appears that we are in trouble, however two subtle facts saves us. First, the BRA instruction itself occupies a PEA instruction, so we really only have to cover a range of 83 instructions. Second, we will not try to brancj both forwards and backwards from the same location. The furthest forward jump will be 1 PEA advanced from the furthest backward jump. Subtracting this off our count and we find our effective range is only 82 PEA instructions, which is exactly what the BRA instruction can cover! The code from above is repeated with the appripriate BRA instructions writted on the side. In addition the extreme cases analyzed above are shown.
000: 4-byte return address 004: JML (0000) 007: PEA $A5A5 BRA -5 010: PEA $5A5A BRA -8 . . 131: PEA $5A5A BRA -128 134: PEA $A5A5 BRA +127 . . 256: PEA $A5A5 BRA +7 259: PEA $5A5A BRA +4 262: JMP $0007 265: JML (0000)
Incidentally, this is why GTE only supports tiles up to 16x16 pixels. A larger tile size would necessitate a wider PEA field which would make the entire field larger than 64K in size and force each line to be too large to span with BRA instruction. Since 16x16 is a reasonable size for the reolution of the IIgs, that's what GTE supports.
The Look is Layered
The blit field descriped in the previous section is a very nice, flexible construct. It is able to move data quickly to the graphics screen and allows arbitrary widths to be blitted at any given time. However, GTE leverages a unique aspect of the IIgs memory layout in order to support a second layer with very little performance degredation.
Recall from the first section that we turned on the Auxilliary Memeory Write flag in order to push data onto the graphics screen. We did not force the Read flag to take any particular value. In fact, if we turn off the Aux. Mem. Read flag, we achieve the efffect of reading from Bank $00 while writing to Bank $01. This is a useful trick indeed! Luckily, the 65816 CPU provides an instruction tailor made to take advantage of this situation -- the PEI instruction. PEI stands for Push Effective Indexed and has the effect of reading a 16-bit value from the Direct Page and pushing it on the stack. This instruction is usually used for pushing local varables on the stack when calling functions, but we are going to use it a bit differently.
Assuming we have some graphics data in Bank $00, by setting the Direct Page to point to that data we can instersperse PEI instructions into our blit field in order to copy both the sataic data from the PEA instruction, but also data from bank $00 on the graphics screen. A bit of code might look like this:
004: JML (0000) 007: PEA $A5A5 010: PEA $5A5A 013: PEI 00 015: NOP 016: PEI 02 019: NOP 022: PEA $1234 . .
Notice that the PEI instructions need to be padded with a NOP in order to maintain the 3-bytes-per-instruction requirement. There is one tricky point about this technique that can be a bear to deal with. Imagine that the blitter begins by jumping into the blit field at address 007, the first PEA instruction. The data on the graphics screen (from right-to-left) is:
.... 34 12 [02L] [02H] [00L] [00H] 5A 5A A5 A5where [--L] and [--H] represent the low and high byte of the contents of a direct page location. Now pretend that the screen is scrolled to the right by one word. This means that the blitter code will be entered at address 007 instead of 004 and the data on the graphics screen becomes
.... -- -- 34 12 [02L] [02H] [00L] [00H] 5A 5ANotice that the data contained in Bank $00 via the direct page has moves as well. Since we want to create the illusion of two independent layers this is not desirable at all. The way out of this problem is to utilize the fact that the direct page can be set to any addess and, when the playing field is scrolled, to move the direct page an equal amount in the opposite direction. In this case, we would subtract 2 from the direct page which would give effective values of $00 and -$02 to the direct page offsets.
.... -- -- 34 12 [00L] [00H] [-2L] [-2H] 5A 5AThis is the correct effect since it exposes new data over the "holes" in the foreground layer and keeps existing data in place.
Keeping the Music Alive
GTE makes the trade-off of increasing the per-line overhoead in order to provide low-latency interrupt service. Interrupts are only every disabled long enough to draw a single line of the blitter field to the graphics screen. Even for a full-screen transfer, the interrupts are only disabled for around 500 microseconds, which easily keeps up with the 16.66 millisecond timer used by most music trackers.
Looking at the core of the aligned, no-sprite blitter
NoSpriteSetup anop lda #ReturnAddress sta BlitRtnAddr clc NoSpriteLoop anop lda [PEA_Ptr],y ; address of left edge adc #offset ; add offset for right edge sta BlitEntryAddr ; the is where to begin lda #endaddr ; (BasePtr),Width_In_Bytes adc [PEA_Ptr],y sta InstPtr ; complete the pointer lda [InstPtr] ; load the PEA instruction tax ; cache in X-register lda #BRAInst ; patch with a BRA sta [InstPtr] sei ; disable interrupts lda StackPtr,y ; get the starting screen addr tcs lda [DirectPagePtr],y ; adjust the direct page sec sbc directPageAdjust tcd lda #R0W1 ; Read Bank 0, Write Bank 1 sta >$E0C068 jml (BlitEntryAdd) ; Draw the line ReturnAddress anop lda #SavedDP ; restore the Direct Page tcd lda #SavedFlags ; return to normal RAM config sta >$E0C068 lda #SavedSP ; restore the Stack Pointer tcd cli ; enable interrupts txa ; re-patch the PEA instruction sta [InstPtr] iny ; advance to the next line iny ldx JumpTable,y jmp (NoSprite,x) dc i2'BlitterDone' NoSprite dc i2'NoSpriteLoop, SpriteSetup'The interrupts are disabled just before the stack is mapped to the graphics screen and immediately aftter restoring the stack to its correct location in Bank $00. The only overhead iin this code is the addition of the cli/sei pair, which is very small.
When sprites are added to the mix there are two changes to the blitter code shown above. First, we need to synchronize with the VBL signal to avoid flicker caused by the graphics screen being displayed after drawing the background but before the sprites are overlaid. Second, we modify the code to draw the spries immediately after the background has been drawn. The code which synchronizes the the VBL looks like
SyncVBL anop lda >$E0C02E ; load the counter value and #$80FF ; mask out the VBL bits asl a ; shift the word around adc #0 ; move MSB -> LSB sec sbc #$00FA ; translate to range [$0 - $105] cmp mintbl,y bcc SyncOut cmp maxtbl,y bcc SyncVBL SyncOut anopAll the code before the first CMP instruction simply extract the current line the the VBL gun is scanning. The actual work is done by comparing the current line to a point of table values which define a range for which we cannot draw the current line. GTE is currently set to have an 11-line delay, which means if the screen is currently painting any line that is within an 11 line range preceeding the current line, then we need to wait until it passes.
Playing Hot Potato
Sprites can be given a priority which will insert them into the front or rear of the sprite queue when they are encountered. Thus, we have specialized list manipulation routines depending on if the sprite is inserted into an empty list, the front or the rear. The Head and Tail insertion routines both take exactly the same amount of time.
HeadInsertion anop ldy cold_list,x ; fetch the slot index ldx sprite_slots+ADDR,y ; get the DP location in Bank $01 lda #$FFFE sta >$010000+PREV,x ; curr->prev = nil lda active_head ; curr->next = active_head sty active_head sta >$010000+NEXT,x tay ldx sprite_slots+ADDR,y ; get address of old slot lda active_head sta >$010000+PREV,x ; curr->next->prev = curr tay rts
TailInsertion anop ldy cold_list,x ; get the new tail ... ldx sprite_slits+ADDR,y ; .. and addreess of new slot lda active_tail ; save the old tail ... sty active_tail ; .. and set the new tail sta >$010000+PREV,x ; curr->prev = old tail tay lda #$FFFE sta >$010000+NEXT,x ; curr->next = nil ldx sprite_slots+ADDR,y ; get the slot address of old slot lda active_tail ; get the current slot sta >$010000+NEXT,x ; curr->prev->next = curr tay rts
FirstInsertion anop tya ; Y-reg = $FFFE on entry ldy cold_list,x ; fetch the slot index and get ldx sprite_slots+ADDR,y ; the DP location in Bank $01 sta >$010000+PREV,x ; curr->prev = nil sta >$010000+NEXT,x ; curr->next = nil tya sta active_head ; head = curr sta active_tail ; tail = curr rts
Wrapping It Up
Future Directions
Overall, the core of GTE is a highly optimized, flexible and expandible piece of code. That does not mean it's perfect. There are two ascpects of the blitter which would be nice to fix. If a technique could be developed to work around one or both of these issues, please let me know, as I would consider this library 100% complete at that point.
The first, most tractable issue related to the sprites. Quite a bit of work is done in order to synchronize the sprite code the VBL hardware and there is a fair bit of overhead dispatching the sprites to the screen. Linked list manipulations, Bank $01 memory blocks, dispatch tables. All of these are overhead. Ideally I would like to have a zero-overdraw sprite system which would allow each line of the screen to be drawn without synchrinizing with the VBL. Any byte of data whitten to the graphics buffer is written in its final form. I don't believe this is feasible, simply because of how multiple layers are implemented.
A slightly less ambitious fix is to find a way to efficiently merge the PEA/PEI blitter fields with a parallel sprite field. The core idea here is that all the sprites that will be drawn on-screen are drawn into an off-screen buffer. Then, each line of this off-screen buffer is blitted on top of the graphics screen once. This has an advantage of moving code out of the inner loop of the blitter, plus overlapping sprites will only write their data to the graphics screen once. A drawback is that we will effectively be drawing each sprite twice, however I believe the the savings in overhead will more than make up for this penalty.
The real implementation detail, is that the "sprite field" must be sparse so that only those bytes which need to be modified are processed, plus it must be structued as executable code, as well. This is indeed a tricky problem.
The second issue with the blitters is that the parallax layer has only word-sized granularity. By that, I mean every instruction in the PEA/PEI field must be one of those two instructions. And since each instruction writes 4 pixels to the screen, we cannot support pixel-perfect parallax scrolling. I would welcome a solution which could even provide byte-sized granularity.