Compiled Sprites

For those reader who are technically inclined, I thought I'd write about how compiled sprites are used in GTE. There are several articles on the Internet about compiled sprites, mostly related to the Allegro game programming library, but all of them seem to say pretty much the same thing about compiled sprites:

  1. They are fast, but
  2. They take up a lot of space
  3. They are inflexible
  4. You can't clip the sprite to the screen edges
  5. You can't manipulate the sprite
  6. Don't use them!!

My experience has been different. Certainly the items listed are valid concerns and they do cause problems if compiled sprites are created in a naive fashion. So, let me take you through the process from the straightforward method to what GTE uses internally.

First Attempts

There are several IIGS-specific sprite compilers out there such as Mr. Sprite, and ???. The basic idea it to take a rectangular chunk of an image, with a mask, and produce a block of assembly code which copies the data onto the graphics screen as quickly as possible. A typical bit of code that is produced looks like

; screen address stored in the x-register
   sei                ; disable interrupts
   phd                ; save the Direct Page
   lda  >$E0C068      ; turn on Bank 1 R/W
   ora  #$0030
   sta  >$E0C068  

   txa
   tcd                ; set the direct page to point to the graphics screen

   lda  <00
   and  #MASK
   ora  #DATA
   sta  <00

   .
   .
   .

   tdc
   clc
   adc  #160          ; go to the next line
   tcd

   lda  <00           ; draw a word that has masking
   and  #MASK
   ora  #DATA
   sta  <00

   lda #DATA          ; draw a word that needs no masking
   sta <02
   .
   .
   .

   lda  >$E0C068      ; turn on Bank 0 R/W
   and  #$FFCF
   sta  >$E0C068
   pld                ; restore Direct Page
   cli                ; enable interrupts
   rtl
A slight modification to the above code is that, if the sprite is less than 192 pixels wide, then two lines can be drawn before increasing the DP (by 320 instead of 160). This makes the code slightly faster since it avoids every other addition; a savings of 4.5 cycles/line.

Clipping to the screen

At first glance, the code presented above appears very inflexible. How could we possible introduce clipping without at least having to keep a running count of the current (x, y) position and making a comparison at each write. This would eliminate any speed advantage that the compiled sprites possessed as well as increase the amount of space needed to store them. Even if that route were taken, one would still have to deal with the case of words which partially overlap the edge of the screen, introducing yet more complexity.

Fortunately, there is a much simpler solution. Look at what we are doing to the data already. We are using a mask to mark which pixels of the image are background and which need to be overwritten by the sprite data. Perhaps there is a way to have a mask which tells the sprite not to draw if it goes past an edge? In fact, we can do exactly this.

So how can we code such an operation? We want the screen mask to be a flag which effectively says 'if this bit is set, don't change the underlying bit, if it is clear, set the underlying bit to the sprite's value'. This can be explicitly represented in a truth table or K-map, but the actual sequence of operations to implement it efficiently are not obvious.

Truth Table                     K-Map
 d  = sprite data
 m  = sprite mask
 s  = screen data
 m' = screen mask

d m s m' | output               \ d m
---------+-------           s m' \ 00  01  11  10
0 0 0 0  | 0                      +---+---+---+---+
0 0 0 1  | 0                   00 | 0 | 0 | X | 0 | 
0 0 1 0  | 1                      +---+---+---+---+
0 0 1 1  | 0                   01 | 0 | 0 | X | 1 | 
0 1 0 0  | 0                      +---+---+---+---+
0 1 0 1  | 0                   11 | 0 | 1 | X | 1 | 
0 1 1 0  | 1                      +---+---+---+---+
0 1 1 1  | 1                   10 | 1 | 1 | X | 1 | 
1 0 0 0  | 0                      +---+---+---+---+
1 0 0 1  | 1
1 0 1 0  | 1
1 0 1 1  | 1
1 1 0 0  | X  (this case doesn't happen,
1 1 0 1  | X     if the pixel is transparent
1 1 1 0  | X     the the sprite data is set
1 1 1 1  | X     to zero )

Solution
  f(d, m, s, m') = d ^ s & ~m & m' ^ s


    \ d m                    \ d m                   \ d m
s m' \ 00  01  11  10    s m' \ 00  01  11  10   s m' \ 00  01  11  10
      +---+---+---+---+        +---+---+---+---+       +---+---+---+---+
   00 | 0 | 0 | 1 | 1 |     00 | 0 | 0 | 0 | 0 |    00 | 0 | 0 | 1 | 1 |
      +---+---+---+---+        +---+---+---+---+       +---+---+---+---+
   01 | 0 | 0 | 1 | 1 |     01 | 0 | 0 | 0 | 0 |    01 | 0 | 0 | 1 | 1 |
      +---+---+---+---+ XOR    +---+---+---+---+ =     +---+---+---+---+ 
   11 | 0 | 0 | 1 | 1 |     11 | 1 | 1 | 1 | 1 |    11 | 1 | 1 | 0 | 0 |
      +---+---+---+---+        +---+---+---+---+       +---+---+---+---+
   10 | 0 | 0 | 1 | 1 |     10 | 1 | 1 | 1 | 1 |    10 | 1 | 1 | 0 | 0 |
      +---+---+---+---+        +---+---+---+---+       +---+---+---+---+

              d          ^             s         =           t1 

    \ d m                    \ d m                   \ d m
s m' \ 00  01  11  10    s m' \ 00  01  11  10   s m' \ 00  01  11  10
      +---+---+---+---+        +---+---+---+---+       +---+---+---+---+
   00 | 0 | 0 | 1 | 1 |     00 | 1 | 0 | 0 | 1 |    00 | 0 | 0 | 0 | 1 |
      +---+---+---+---+        +---+---+---+---+       +---+---+---+---+
   01 | 0 | 0 | 1 | 1 |     01 | 1 | 0 | 0 | 1 |    01 | 0 | 0 | 0 | 1 |
      +---+---+---+---+ AND    +---+---+---+---+ =     +---+---+---+---+ 
   11 | 1 | 1 | 0 | 0 |     11 | 1 | 0 | 0 | 1 |    11 | 1 | 0 | 0 | 0 |
      +---+---+---+---+        +---+---+---+---+       +---+---+---+---+
   10 | 1 | 1 | 0 | 0 |     10 | 1 | 0 | 0 | 1 |    10 | 1 | 0 | 0 | 0 |
      +---+---+---+---+        +---+---+---+---+       +---+---+---+---+

              t1         &             ~m        =           t2 

    \ d m                    \ d m                   \ d m
s m' \ 00  01  11  10    s m' \ 00  01  11  10   s m' \ 00  01  11  10
      +---+---+---+---+        +---+---+---+---+       +---+---+---+---+
   00 | 0 | 0 | 0 | 1 |     00 | 0 | 0 | 0 | 0 |    00 | 0 | 0 | 0 | 0 |
      +---+---+---+---+        +---+---+---+---+       +---+---+---+---+
   01 | 0 | 0 | 0 | 1 |     01 | 1 | 1 | 1 | 1 |    01 | 0 | 0 | 0 | 1 |
      +---+---+---+---+ AND    +---+---+---+---+ =     +---+---+---+---+ 
   11 | 1 | 0 | 0 | 0 |     11 | 1 | 1 | 1 | 1 |    11 | 1 | 0 | 0 | 0 |
      +---+---+---+---+        +---+---+---+---+       +---+---+---+---+
   10 | 1 | 0 | 0 | 0 |     10 | 0 | 0 | 0 | 0 |    10 | 0 | 0 | 0 | 0 |
      +---+---+---+---+        +---+---+---+---+       +---+---+---+---+

              t2         &             m'        =           t3 

    \ d m                    \ d m                   \ d m
s m' \ 00  01  11  10    s m' \ 00  01  11  10   s m' \ 00  01  11  10
      +---+---+---+---+        +---+---+---+---+       +---+---+---+---+
   00 | 0 | 0 | 0 | 0 |     00 | 0 | 0 | 0 | 0 |    00 | 0 | 0 | 0 | 0 |
      +---+---+---+---+        +---+---+---+---+       +---+---+---+---+
   01 | 0 | 0 | 0 | 1 |     01 | 0 | 0 | 0 | 0 |    01 | 0 | 0 | 0 | 1 |
      +---+---+---+---+ XOR    +---+---+---+---+ =     +---+---+---+---+ 
   11 | 1 | 0 | 0 | 0 |     11 | 1 | 1 | 1 | 1 |    11 | 0 | 1 | 1 | 1 |
      +---+---+---+---+        +---+---+---+---+       +---+---+---+---+
   10 | 0 | 0 | 0 | 0 |     10 | 1 | 1 | 1 | 1 |    10 | 1 | 1 | 1 | 1 |
      +---+---+---+---+        +---+---+---+---+       +---+---+---+---+

              t3         ^             s        =         f(d, m, s, m') 

; given screen data = $ABCD, sprite data = $1200, sprite mask = $00FF, screen mask = $F0F0

  lda <00             ; load the screen data     $ABCD
  eor #DATA           ; XOR with sprite data     $ABCD ^ $1204 = $B9C8
  and #~MASK          ; and with the inv. mask   $B9C8 & $FF00 = $B900
  and screen,y        ; and with the scn. mask   $B908 & $F0F0 = $B000
  eor <00             ; XOR back with the data   $B000 ^ $ABCD = $1BCD
  sta <00             ; store the final data     $1BCD

The trick to understanding the code snippet above are the two identities

We XOR the two pieces of data together and then use the masks to set bits equal to zero where appropriate and then use a final XOR to either get back the screen or sprite data.

So, using this technique, the code generated for each sprite becomes something like

; screen address stored in the x-register
; mask address stored in the y-register, bank register points to the mask data

   txa
   tcd                ; set the direct page to point to the graphics screen

   lda  <00           ; draw a word that has masking
   eor	#DATA
   and  #~MASK
   and  |0000,y
   eor	<00
   sta  <00

   tdc
   clc
   adc  #160          ; go to the next line
   tcd

   tya
   clc
   adc  #320
   tay

   lda <02            ; draw a word that needs no masking
   eor	#DATA
   and  |0000,y
   eor	<00
   sta <02

As you can see, we do incur a penalty that even if a word needs no masking, we still have to do most of the work to make sure it is clipped properly.

Scanline Decomposition

So, now that we've cleverly solved the clipping problem, we're done, right? Not yet!! While the clipping solution above takes care of the left and right sides of the screen, we still must deal with clipping to the top and bottom. If the sprite is left as one contiguous code block, then the screen mask would have to extend above and below the physical screen by one full screen height, which would triple the memory required. This is not a good thing.

The get around this problem, we need to be able to draw each line of the sprite independently. That way, if the sprite is above the top of the screen, we just start drawing at the first visible line; and if the sprite extends below the screen, we stop drawing after the last visible line.

To do this, we make a very small and simple dispatch mechanism utilizing the JMP (0000,x) instruction. Aside from being able to use a jump table directly, this instruction loads the address based on the value in the K register, not the B register like every other absolute addressing mode. One detail is that we add guard entries at line positions -1 and height+1 in the jump table to signal when there are no more lines to draw. So, we decompose our sprite as follows

; screen address stored in the accumulate
; mask address stored in the y-register, bank register points to the mask data
; current line * 2 stored in the x-register

;
; Generic sprite dispatch code
;
draw:
   tcd                ; set the direct page to point to the graphics screen
   jmp  sprite

nextline:
   tdc
   clc
   adc  #160          ; go to the next line
   tcd

   tya
   clc
   adc  #320
   tay
   tcd                ; set the direct page to point to the graphics screen

   txa
   clc
   adc  #STEP         ; can by +2 or -2
   tax

   jmp  sprite

alldone:
   rtl

;
; Sprite Code
;
sprite:
   jmp  (jmptbl, x)

   dc   a2'done'
jmptbl:
   dc   a2'line1'
   dc   a2'line2'
   .
   .
   dc   a2'lineN'
   dc   a2'done'

done:
   jmp  alldone

line1:
   lda  <00           ; draw a word that has masking
   eor	#DATA
   and  #~MASK
   and  |0000,y
   eor	<00
   sta  <00
   jmp  nextline
   
line2:
   lda <02            ; draw a word that needs no masking
   eor	#DATA
   and  |0000,y
   eor	<00
   sta <02
   jmp nextline

So, using the structure above, we can take care of vertically clipping the sprite and, as a bonus, we can flip the sprite vertically just by stepping through the scan lines in reverse order. It's always good to have some benefit for adding extra complexity.

Scanline Merging

The details described so far are what GTE currently uses for its compiled sprite implementation. However, there is still room for improvement. Consider a sprite that is just a simple, solid square. Since every scan line of the sprite is identical, we could just generate code to draw a simple line and then set every address in jmptbl to point to that line. This would effectively compress the memory needed to store the sprite without slowing down the execution.

The biggest sticking point in implementing scanline merging is how to make it fast enough to be practical. A first attempt might be to simply compare each line to those before it and look for a match. This is undesirable for two reasons: First, this is what is known as an O(n2) algorithm, which means its execution time grows quadratically with the size of the input. Thus, a sprite which is 16 lines high would take four times as long to compile as one which is only 8 lines high. Second, it is very time intensive to compare the entire width of two sprites. As the sprites get bigger, this leads to a linear slow down.

The technique GTE uses to mitigate these issues is to extract a line of data from the sprite and then compute a 16-bit CRC value. We can then use this single value to compare for equality. Of course, there are different byte sequences which yield the same CRC, but this will only happen, on average, once every 65536 lines. In practice this is even less of an issue since the lines of a single sprite are spatially related and thus not independent samples.

Once the CRCs for each line are computed, they are paired with the line index, (CRC[i], i), and then sorted by CRC value. Now it is a simple matter to scan though the list in CRC order and compile a line once for each CRC value and set the dispatch pointers for each line i to point to the generated code.

In limited testing, I've seen savings between 0% (all lines different) and 95% (all lines the same). On average, there are about 2 or 3 pairs of identical scanlines which represents a savings of about 10-20% for a typical sprite of height 16-24 pixels. Once GTE is able to save compiled sprites out to disk, the extra time needed for this type of compilation will be mitigated and this should be the default choice for most users. Remember, it cannot increase the size of the sprite, only decrease it.

Sprite Packs (Not Implemented)

Building on the idea of Scanline Merging, there are often multiple frames of animation for a single sprite. Sometimes the frames differ in subtle ways and most of the data is the same between the two. Instead of limiting scanline merging to only one sprite, we designate a set of sprites and share scan lines between as many as possible.

This process has to be bounded since the addresses in the jump table limit the code size of a sprite pack to 64K, but, in practice, this would be a valuable optimization.

Conclusion

So, let's revisit the warnings given to not use compiled sprites.

  1. They take up a lot of space
    Not true! As demonstrated with scanline merging, the compiled sprite can actually be smaller than the raw pixel data.
  2. They are inflexible
    False! By using different code generators, many different effects can be achieved. Some which are very difficult to do otherwise
  3. You can't clip the sprite to the screen edges
    Asolutely False! This was the first hurdle we overcame. In addition our solution required only 1 extra instruction per word and allows for arbitrary, full-screen masking.
  4. You can't manipulate the sprite
    OK, I'll conceed this point. Once compiled it is very difficult to change the data being draw. But how often do you need to do this?
  5. Don't use them!!
    Do use them....where appropriate. Compiled sprites are a great solution for me and this project. The IIGS has a unique combination of a lot of memory, but a slow processor. in addition, memory accesses are relatively fast and there is no cache to worry about. On any modern PowerPC or x86 processor, especially one with a modern graphics card, compiled sprites are probably not a good solution.