Navigation
Home & news
Random page
All pages

Databases
Fortune cookies
SID themes

Page collections
Blag
Chip music
Chipophone
Games
Hardware projects
Music downloads
Obfuscated programming
Piano music
Sane programming
Scene productions
SID related pages
Software downloads
Underhanded code
Video clips

Featured pages
15 Years of Scene Spirit
Å-machine
A Mind Is Born
Autosokoban
AVR Programmer 64
Bach Forever
BASIC Music
Blackbird
C64 Theremin
Chipophone
Chuck Rock
Commodordion
Computers
Craft
Craverly Heights in Dialog
Dial-a-SID
Dialog
Elements of Chip Music
Evolution of C64 games
Faking Fissile Material
Family Bass
Fratres
GCR decoding on the fly
Glyptodont Live
Guitar Slinger
Hanlon Fugue
Hardsync
Impossible Bottle
Kernighan's lever
Live at Fjälldata
Lunatico
Machine Yearning
MISC
Music For Microcontrollers
Padme
Parallelogram
Partita Prelude
Paulimba
Perpetual Fragility
Poems for bugs
Qwertuoso
Safe VSP
Sidreloc
Sixtyforgan
Sommarfågel
Spillhistorie
Spindle v3
Tenor Commodordion
Three PETSCII pieces
TTY demystified
Vivaldi Summer Presto
Vocalise
Watch Room
We learn the nibbles
Wings I've lost in dreams
Withering Bytes

Fund my projects
Patreon
Steady

Don't miss
Air on a Rasterline

Forum
Register
Log in
Latest comments

Syndication
RSS feed

Parallelogram

Parallelogram is a demo running on the Commodore One extender board, which contains an Altera Cyclone III FPGA and an SDRAM chip. The logic design was made from scratch, including a homebrew CPU, FM synth and blitter with pixel shader support. The demo won the wild compo at Revision 2012.

Download

lft_parallelogram_core (C-One core file and icon, zip archive, 331.9 kB)
lft_parallelogram (Presentation video with captured demo, avi, 517.1 MB)
Linus Akesson - Parallelogram (Soundtrack, MP3, 5.1 MB)

The demo also has a pouët page, of course.

Custom logic

The system is coded in Verilog and compiled used Altera's free toolset (Quartus Web edition). PLLs, multipliers and memory blocks are instantiated from within Quartus using so called megafunctions, but the rest of the project consists of plain Verilog files edited with Vim. I used gtkwave to simulate parts of the system when things didn't work, and sometimes that was very helpful.

The overall architecture is illustrated in the presentation video around the 1 minute mark: The CPU is in control of execution, and accesses the external memory through a 16 KB cache. Since I have no control over the initial contents of the SDRAM chip, the demo must be stored somewhere on the FPGA. I opted for a solution where the cache is preloaded with the demo binary at boot, marked as dirty. As other memory gets accessed, the demo gets written "back" into the SDRAM. This limits the demo to 16 KB.

Memory

The SDRAM has a 16-bit bus width, and this property permeats the entire design. Pixels are stored as a0rrrr0gggg0bbbb, where the a bit is a generic alpha bit that can be used freely by software. It conveniently coincides with the sign bit. The point of having zeroes between the fields is that it simplifies saturated addition of colours.

There's an embarrassing error in the text at the beginning of the demo, where it says that only 128 KB of external memory is used. In fact, the system uses 2 MB (1 megaword) of the SDRAM, which requires 20 address bits, but the CPU only has direct access to the first 128 KB because addresses are stored in 16-bit registers. Memory is treated as a rectangular grid of words, 2048 rows by 512 columns. The blitter uses row/column addressing, and has access to the entire 2 MB. Frame buffers are 320 by 240 pixels, and are stored as sub-rectangles occupying columns 0 through 319.

Memory map

(Feel free to skip ahead if you're not interested in this much detail...)

char in map = 8x16 pixels (words)

C = cpu memory with preloaded contents
c = unpacked executable
f = upper half is 64-character 8x8 font
s =
        $70 sine table
        $71 freq table
        $72 channel data
        $73 synth register copy
        $74 constant random table
        $75 raster bar table
        $7e stack
        $7f stack
1 = video frame buffer 1
2 = video frame buffer 2
3 = video frame buffer 3
w = workspace frame buffer (for post fx)
d, u, v = free memory for effect data
        e.g. smoke (density, x-vel, y-vel), front and back, 256x242
m = 32x32 texture map
e = echo buffer
. = kept zero at all times

 0                                       320     384          511  Row  CPU Address
 ----------------------------------------------------------------
|CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC| 000  0000
|                                                                | 010  2000
|cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc| 020  4000
|cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc| 030  6000
|cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc| 040  8000
|                                                                | 050  a000
|ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff| 060  c000
|ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss| 070  e000
|                                                                | 080
|                                                                | 090
|                                                            mmmm| 0a0
|                                                            mmmm| 0b0
|                                                                | 0c0
|                                                                | 0d0
|                                                                | 0e0
|.........................................                      .| 0f0
|1111111111111111111111111111111111111111.       eeeeeeee       .| 100
|1111111111111111111111111111111111111111.       eeeeeeee       .| 110
|1111111111111111111111111111111111111111.       eeeeeeee       .| 120
|1111111111111111111111111111111111111111.       eeeeeeee       .| 130
|1111111111111111111111111111111111111111.       eeeeeeee       .| 140
|1111111111111111111111111111111111111111.       eeeeeeee       .| 150
|1111111111111111111111111111111111111111.       eeeeeeee       .| 160
|1111111111111111111111111111111111111111.       eeeeeeee       .| 170
|1111111111111111111111111111111111111111.       eeeeeeee       .| 180
|1111111111111111111111111111111111111111.       eeeeeeee       .| 190
|1111111111111111111111111111111111111111.       eeeeeeee       .| 1a0
|1111111111111111111111111111111111111111.       eeeeeeee       .| 1b0
|1111111111111111111111111111111111111111.       eeeeeeee       .| 1c0
|1111111111111111111111111111111111111111.       eeeeeeee       .| 1d0
|.........................................       eeeeeeee       .| 1e0
|.........................................       eeeeeeee       .| 1f0
|2222222222222222222222222222222222222222.       eeeeeeee       .| 200
|2222222222222222222222222222222222222222.       eeeeeeee       .| 210
|2222222222222222222222222222222222222222.       eeeeeeee       .| 220
|2222222222222222222222222222222222222222.       eeeeeeee       .| 230
|2222222222222222222222222222222222222222.       eeeeeeee       .| 240
|2222222222222222222222222222222222222222.       eeeeeeee       .| 250
|2222222222222222222222222222222222222222.       eeeeeeee       .| 260
|2222222222222222222222222222222222222222.       eeeeeeee       .| 270
|2222222222222222222222222222222222222222.       eeeeeeee       .| 280
|2222222222222222222222222222222222222222.       eeeeeeee       .| 290
|2222222222222222222222222222222222222222.       eeeeeeee       .| 2a0
|2222222222222222222222222222222222222222.       eeeeeeee       .| 2b0
|2222222222222222222222222222222222222222.       eeeeeeee       .| 2c0
|2222222222222222222222222222222222222222.       eeeeeeee       .| 2d0
|.........................................       eeeeeeee       .| 2e0
|.........................................       eeeeeeee       .| 2f0
|3333333333333333333333333333333333333333.       eeeeeeee       .| 300
|3333333333333333333333333333333333333333.       eeeeeeee       .| 310
|3333333333333333333333333333333333333333.       eeeeeeee       .| 320
|3333333333333333333333333333333333333333.       eeeeeeee       .| 330
|3333333333333333333333333333333333333333.       eeeeeeee       .| 340
|3333333333333333333333333333333333333333.       eeeeeeee       .| 350
|3333333333333333333333333333333333333333.       eeeeeeee       .| 360
|3333333333333333333333333333333333333333.       eeeeeeee       .| 370
|3333333333333333333333333333333333333333.       eeeeeeee       .| 380
|3333333333333333333333333333333333333333.       eeeeeeee       .| 390
|3333333333333333333333333333333333333333.       eeeeeeee       .| 3a0
|3333333333333333333333333333333333333333.       eeeeeeee       .| 3b0
|3333333333333333333333333333333333333333.       eeeeeeee       .| 3c0
|3333333333333333333333333333333333333333.       eeeeeeee       .| 3d0
|.........................................       eeeeeeee       .| 3e0
|.........................................       eeeeeeee       .| 3f0
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 400
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 410
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 420
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 430
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 440
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 450
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 460
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 470
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 480
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 490
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 4a0
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 4b0
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 4c0
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 4d0
|.........................................       eeeeeeee       .| 4e0
|                                                eeeeeeee        | 4f0
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 500
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 510
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 520
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 530
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 540
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 550
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 560
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 570
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 580
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 590
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 5a0
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 5b0
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 5c0
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 5d0
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 5e0
|                                                                | 5f0
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 600
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 610
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 620
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 630
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 640
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 650
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 660
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 670
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 680
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 690
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 6a0
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 6b0
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 6c0
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 6d0
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 6e0
|                                                                | 6f0
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 700
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 710
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 720
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 730
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 740
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 750
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 760
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 770
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 780
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 790
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 7a0
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 7b0
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 7c0
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 7d0
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 7e0
|                                                                | 7f0
 ----------------------------------------------------------------

The cache is direct-mapped, which means that memory addresses where the low bits are identical will compete for the same cache entry. By placing data (e.g. textures) in columns 320 through 511, it will remain in the cache even when the frame buffer is accessed.

VGA

The VGA generator consists of a frontend and a backend. The frontend reads pixels directly from SDRAM and writes them to a FIFO. Since each rasterline is stored in a single SDRAM row, the entire rasterline can be read in one burst. Between the lines, the frontend backs off so other parts of the system can access the memory.

The backend runs in a separate clock domain. At vertical blanking, it sends an asynchronous signal back to the frontend to trigger a new frame, and then it reads 320*240 pixels from the FIFO. Each row is stored in a buffer and emitted twice, since the VGA signal has 480 rows.

The address of the frame buffer is CPU-controlled, and Parallelogram uses triple buffering.

CPU

The CPU was written from scratch. I considered using an existing design, but it was more fun to do it myself, and I was able to take advantage of the added flexibility. For instance, at one point the demo was slightly larger than 16 KB, but I could fix this by adding some new instructions and a new addressing mode in order to make the code compress better.

The CPU is not particularly fast, because most of the work is done by the pixel shaders. Hence, it is implemented without pipelining. There are eight general purpose 16-bit registers. Other registers include a program counter, a stack pointer, a 32-bit product register (accessed as a high and a low half) and status bits (zero and carry). These are accessed using special instructions.

Starting at address 0, there are three vector instructions, which are typically relative jumps: Boot, UART and timer. The boot instruction is executed at boot. The UART instruction is executed (after pushing the program counter) whenever a byte appears on the debug UART; this was used to load new code into the running system during development. The timer instruction gets executed (after pushing the program counter) every 10 ms, and controls music playback.

This is what the instruction set looks like:

Instructions

Move immediate high (d <- c * 32)
00 ccc ddd ccccc ccc    movih   d, c

Arithmetic/Logic
01 000 ddd 00000 sss    add     d, s
01 000 ddd 1cccc-ccc    addi    d, c
01 001 ddd 00000 sss    adc     d, s
01 001 ddd 1cccc-ccc    adci    d, c
01 010 ddd 00000 sss    sub     d, s
01 010 ddd 1cccc-ccc    subi    d, c
01 011 ddd 00000 sss    and     d, s
01 011 ddd 1cccc-ccc    andi    d, c
01 100 ddd 00000 sss    or      d, s
01 100 ddd 1cccc-ccc    ori     d, c
01 101 ddd 00000 sss    xor     d, s
01 101 ddd 1cccc-ccc    xori    d, c
01 110 ddd 00000 sss    cmp     d, s
01 110 ddd 1cccc-ccc    cmpi    d, c
01 111 ddd 00000 sss    mov     d, s
01 111 ddd 1cccc-ccc    movi    d, c

Branch (o = signed offset relative pc)
10 0 0001 oooooo-ooo    bgt     label
10 0 0011 oooooo-ooo    bne     label
10 0 0101 oooooo-ooo    bcc,bge label
10 0 1010 oooooo-ooo    bcs,blt label
10 0 1100 oooooo-ooo    beq     label
10 0 1110 oooooo-ooo    ble     label
10 0 1111 oooooo-ooo    bal     label

Subroutine call
10 1 0001 oooooo-ooo    cgt     label
10 1 0011 oooooo-ooo    cne     label
10 1 0101 oooooo-ooo    ccc,cge label
10 1 1010 oooooo-ooo    ccs,clt label
10 1 1100 oooooo-ooo    ceq     label
10 1 1110 oooooo-ooo    cle     label
10 1 1111 oooooo-ooo    cal     label

Memory
11 000 ddd ooooo sss    ld      d, s+o
11 001 ddd ooooo sss    st      s+o, d

I/O
11 010 ddd 00ppp 000    in      d, p
11 011 ddd 00ppp 000    out     p, d

Vector jump/call (e = entry in global vector table)
11 100 000 0 eeeeeee    jv      e
11 101 000 0 eeeeeee    cv      e

Load effective address (o = unsigned offset relative pc)
11 101 ddd 1 ooooooo    lea     d, label

Miscellaneous
11 111 ddd 00000 000    push    d
11 111 ddd 00001 000    pop     d
11 111 000 00010 000    nop
11 111 ddd 00011 sss    mul     d, s    Store result in special product register
11 111 ddd 00100 000    stsp    d       Store d into stack pointer
11 111 ddd 00101 000    prod    d, s    Store s:d in product register
11 111 ddd 00110 000    jr      d       Jump to address in register
11 111 ddd 00111 000    cr      d       Call address in register
11 111 000 01000 000    ret
11 111 ddd 01001 000    wait    d       Wait for status bit (blitter done, vblank...)
11 111 ddd 01010 000    send    d       Transmit on debug UART
11 111 ddd 01011 000    ldsf    d       Load d from status flags
11 111 ddd 01100 000    stsf    d       Store d into status flags
11 111 ddd 01101 000    initv   d       Set global vector table address

Input ports

000 product, low half
001 product, high half
010 status flags (blitter done, vblank...)
011 uart receive buffer
100 frame counter (global time)
101 benchmark timer

Output ports

000 blitter row
001 blitter column
010 blitter width
011 blitter height + start
100 blitter program
101 active video page [1..3]
110 synth register select
111 synth register data

And here is some example code, which implements signed multiplication — the CPU only provides unsigned multiplication.

muls
                ; r2 * r3 -> r1:r0
                ; clobbers product register

                mul     r2, r3
                in      r1, 1

                mov     r0, r2
                add     r0, r0
                bcc     .muls_1
                sub     r1, r3
.muls_1
                mov     r0, r3
                add     r0, r0
                bcc     .muls_2
                sub     r1, r2
.muls_2
                in      r0, 0
                ret

The demo is written in assembly language, so I obviously had to write my own assembler. It's quite limited — for instance, values must be either numeric constants or labels — but it was sufficient for my purposes. Shader code, which will be described presently, is inlined with the rest of the code and handled by the same assembler.

First shader running.

Blitter

The blitter is a coprocessor that executes a small shader program for each pixel in a sub-rectangle of memory. The work is distributed across ten identical shader cores, thus exploiting the parallel nature of the FPGA.

First, the CPU writes the address of some shader code into output register 4. This instructs the blitter to start copying the shader from main memory into local RAM blocks within each of the ten shader cores. The first word contains the size of the shader, and is followed by that many longwords (in little endian order) of shader instructions and data. Then, for any number of rectangles, the CPU loads the row, column, width and height into output registers 0 through 3, where the final write to register 3 starts the blitter operation. Before each operation, the CPU must ensure that the blitter has completed the previous job, by waiting on a status bit.

The shader cores deal with 32-bit words (longwords). Each core has a 256-word memory, where execution starts at address 0. The instruction set has a DSP-like flavour, because each instruction consists of several sub-instructions that are executed simultaneously. There are eight 32-bit registers, which are treated as 16.16 fixpoint numbers. Contrary to the CPU registers, these are not general purpose. Registers r0 through r3 receive the results of simple ALU operations (add, xor etc), r4 and r5 can be used to hold values (and are primed with the current x and y coordinates within the blitting rectangle), r6 contains the result of the latest multiplication and r7 contains the result of the latest shader RAM access. Of these, registers r0 through r5 keep their value unless it's explicitly modified by an instruction, whereas r6 and r7 are volatile and get clobbered unless you use them immediately after assigning them. Expressed in a different way, registers r6 and r7 get written at every clock cycle, regardless of whether there's an instruction in the shader assembly code describing what to put into them.

Here's the shader instruction set:

Instructions come in two varieties:

: aop rd, ra, rb : mv rd, rs : mul ra, rb : ld ..., ...

1aaaaaaa aaaapppp ppccccrr rrrrrrrr

a = alu op,
        000 dd aaa bbb          register d becomes a & b
        001 dd aaa bbb          register d becomes a + b
        010 dd aaa bbb          register d becomes a - b
        011 dd aaa bbb          register d becomes a | b
        100 dd aaa bbb          register d becomes a ^ b
        101 dd aaa bbb          register d becomes a min b
        110 dd aaa bbb          register d becomes a max b
        111 dd aaa bbb          register d is read from global ram at
                                  col, row according to registers a, b

p = product op,
        aaa bbb                 register 6 becomes signed fixed-point
                                  adjusted product of registers a, b

c = copy op,
        0 sss                   register 4 is read from register s
        1 sss                   register 5 is read from register s

r = ram op,
        0 aaaaaasss             register 7 is read from shader ram at
                                  aaaaaa00 + floor(register s)
        10 aaaaaaaa             register 7 is read from shader ram at a
        11 dddaaaaa             register 7 is trashed; register d is
                                  written to shader ram at 110aaaaa

: aop rd, ra, rb : endp rr : jsr xyz

0aaaaaaa aaaa---- ---sssss ssssssss

a = alu op, same as before

s = special op,
        00000 --------          no operation
        00001 --------          terminate with no pixel
        00010 -----rrr          terminate with pixel according to register r
        00100 --------          store sign bits of all registers into rSign
        00101 --sssttt          r7 <- (rx[t] & 0xffff) ^ (rSign[sss]? 0 : 0xffff)
        00110 iiiijjjj          add signed integer i to r4 and j to r5
        10aaa aaaaarrr          jump to a if r >= 0
        11aaa aaaaarrr          jump to a if r < 0

Execution uses alternating fetch/execute cycles, where the
execute part may be stalled when global ram is accessed.

00000000 00000000 00000000 00000000 is a nop instruction.

Here's an example shader for visualising the Julia set:

sh_julia
                shader  .end

                :ld     r7, .xmid
                :sub    r0, r4, r7      :ld     r7, .ymid
                :sub    r1, r5, r7      :ld     r7, .scale
                :mul    r6, r0, r7      :ld     r7, .scale
                :mov    r0, r6          :mul    r6, r1, r7      :st     $d8, r4
                :mov    r1, r6          :ld     r7, .initcount
                :mov    r3, r7          :mul    r6, r0, r0      :st     $d9, r5
                :mov    r4, r6          :mul    r6, r1, r1
                :mov    r5, r6
.loop
                ; square z

                :mul    r6, r0, r1
                :add    r1, r6, r6
                :sub    r0, r4, r5      :ld     r7, .c_re

                ; add c

                :add    r0, r0, r7      :ld     r7, .c_im
                :add    r1, r1, r7      :mul    r6, r0, r0

                ; determine length

                :mov    r4, r6          :mul    r6, r1, r1
                :mov    r5, r6          :add    r2, r4, r6      :ld     r7, .limit
                :sub    r2, r2, r7      :ld     r7, .step
                :sub    r3, r3, r7      :jpos   r2, .break

                :jpos   r3, .loop
.break
                :ld     r7, .topcount
                :sub    r1, r3, r7      :ldd    r7, .palette, r3
                :mov    r1, r7          :jpos   r1, .bg
                :emit   r1
.bg
                :skip

.xmid           long    $00a00000
.ymid           long    $00780000
.c_re           long    $fffff000
.c_im           long    $ffff8000
.scale          long    $00000300
.initcount      long    $00100000
                shalign                 ; aligns to 4-longword address, for ldd instruction
.topcount       long    $000f0000
.step           long    $00010000
.limit          long    $00040000
                long    #000            ; the '#' encodes a colour into a longword
.palette
                long    #000
                long    #100
                long    #211
                long    #322
                long    #433
                long    #544
                long    #655
                long    #766
                long    #877
                long    #988
                long    #a99
                long    #baa
                long    #988
                long    #766
                long    #544
.end

A shader produces a single word of output, which gets stored at the predetermined memory position for which the shader was executed. Alternatively, the shader may choose to terminate itself without writing to memory. Writing is done to the external SDRAM directly, bypassing the cache, because in most situations the blitter will be constructing a frame buffer that will be consumed by the VGA generator (which also accesses the SDRAM directly), so there's no need to pollute the cache. However, when reading main memory, the blitter uses the cache, because many pixel computations typically depend on the same data, such as textures and the sine table. Sometimes (as in the shadebob effect), a shader depends on data written by earlier blits. In these situations, the CPU must invalidate the cache in between the blitter operations, in order to make the output from earlier blits visible.

Synthesiser

The final part of the logic design is a 16-channel, 4-op FM synthesiser with resonant low-pass filters on each channel, and a global echo facility. Each channel is indepently controlled using 32 hardware registers, arranged as follows:

00      osc 0 frequency, low word
01      osc 0 frequency, high word
02      osc 0 gain
03      filter cutoff
04      osc 1 frequency, low word
05      osc 1 frequency, high word
06      osc 1 gain
07      filter resonance
08      osc 2 frequency, low word
09      osc 2 frequency, high word
0a      osc 2 gain
0b      left fader
0c      osc 3 frequency, low word
0d      osc 3 frequency, high word
0e      osc 3 gain
0f      right fader
10      osc 0 amount of modulation from osc 0
11      osc 0 amount of modulation from osc 1
12      osc 0 amount of modulation from osc 2
13      osc 0 amount of modulation from osc 3
14      osc 1 amount of modulation from osc 0
15      osc 1 amount of modulation from osc 1
16      osc 1 amount of modulation from osc 2
17      osc 1 amount of modulation from osc 3
18      osc 2 amount of modulation from osc 0
19      osc 2 amount of modulation from osc 1
1a      osc 2 amount of modulation from osc 2
1b      osc 2 amount of modulation from osc 3
1c      osc 3 amount of modulation from osc 0
1d      osc 3 amount of modulation from osc 1
1e      osc 3 amount of modulation from osc 2
1f      osc 3 amount of modulation from osc 3

Each operator is based on a sine oscillator which is phase modulated by a weighted sum of the (previous) output of each of the four operators. When an operator modulates itself, the result is noise. The filter then receives a weighted sum of the operators as input, and produces a mono output signal, which is panned and attenuated by two faders (left and right) to produce a stereo mix.

Channels 5 through 15 are connected to the echo buffer. This, as well as the interrupt rate and hence the tempo of the song, is hardcoded in the logic design, because there was no need to make it CPU-controllable for the Parallelogram soundtrack. The echo facility has a small input FIFO and a small output FIFO, but the bulk of the echo buffer is stored in main memory, which is accessed by stalling the CPU just before it's about to fetch an instruction. The left and right parts of the echo output are flipped and mixed into the final sound signal, as well as fed back into the echo buffer.

The synthesiser, as described above, is only concerned with what goes on at sample rate (44.1 kHz). The CPU then modifies these parameters at control rate (100 Hz), in order to implement e.g. envelopes for the operator modulation parameters. This playroutine also updates some global variables reflecting the song position, the current bass drum level and so on, which are then accessed by the visual effects.

C-One hooked up to a UART via an opto isolator.

Toolchain

Apart from the assembler mentioned above, I wrote a tracker which could emulate the FM synthesiser. This allowed me to compose the music interactively on my regular computer. Another tool converts the music data into binary data that can be accessed by the demo, specifically by the playroutine executing in the timer interrupt.

The assembled demo is compressed by a custom packer, and prepended with decompression code. This becomes the demo binary, and is used as initial RAM contents when compiling the FPGA core. However, during development, I didn't want to recompile the logic design for every little change in the demo software. After all, recompiling all the Verilog code and mapping it to the FPGA takes approximately 40 minutes (with ten shader cores and the highest optimisation settings). Hence, I placed a little bootloader in the UART interrupt, and wrote a communication tool to send a demo binary over a serial cable into the chip. The C-One (somewhat surprisingly) does not have a serial port, so I just attached some wires to the mdb bus which is accessible from the extender board.

Finally, to get a nice video capture, I designed a communication protocol for transmitting compressed video frames from within the FPGA over the UART to the computer, where they get uncompressed and stored as pnm files. First I ran the demo in realtime, transmitting the current system time whenever a frame was generated. This gave me a log of which frames were actually present: it wouldn't be honest to present a video capture with a higher frame rate than the actual hardware, and besides some of the effects are stateful and depend on the timing of earlier frames. The demo was then restarted in a non-realtime mode, where the host requests frames (using the log) and the demo computes all effects according to the communicated timestamps rather than the system clock.

Demo code

The demo itself is organised in a pretty straight-forward manner. As mentioned, the first thing that happens is that the code is decompressed. Then, the synthesiser is initialised and the screen displays a solid blue framebuffer for a couple of seconds, to allow the monitor to synchronise. Then, the timer interrupt is enabled, starting music playback. A mainloop reads out the current song position and advances along a script, where the different parts of the demo are described using code pointers (there's a song position, a setup routine, and a per-frame routine).

Most effects calculate some per-frame parameters in the CPU, store the resulting values right into a shader, load the shader into the blitter, then blit. There are utility routines for common functionality, such as invalidating the cache or computing A*sin(B*t+C) where t is the global time.

Standalone extender board

Since the demo runs entirely on the extender board, the C-One mainboard isn't necessary. To make the demo platform a bit more portable, I made my own mainboard replacement. It contains a microcontroller for reading the core image off an SD card and transmitting it to the FPGA at power-on, and it has a bunch of discrete components doing digital-to-analogue conversion of the audio and video signals.

However, the demo is fully C-One compatible, meaning that if you own a C-One you can simply drop the core file into your machine and run it.

Final words

This project was quite a ride, as it basically involved learning Verilog, FPGAs and hardware design. I did have some contact with FPGAs during my engineering education, but in those courses we would just modify existing VHDL code, and all the tricky parts had already been taken care of. Hardware bugs are quite different from software bugs, and it was very frustrating and rewarding to learn about all the gotchas the hard way. Looking back it has been very enjoyable. Hopefully this will also inspire other people to learn new skills and to build cool things!

Posted onsdag 11-apr-2012 22:03

Discuss this page

Disclaimer: I am not responsible for what people (other than myself) write in the forums. Please report any abuse, such as insults, slander, spam and illegal material, and I will take appropriate actions. Don't feed the trolls.

Jag tar inget ansvar för det som skrivs i forumet, förutom mina egna inlägg. Vänligen rapportera alla inlägg som bryter mot reglerna, så ska jag se vad jag kan göra. Som regelbrott räknas till exempel förolämpningar, förtal, spam och olagligt material. Mata inte trålarna.

Anonymous
ons 11-apr-2012 23:31

niiccee.
do you plan to release some more information regarding the bitbuf?
I want to build one myself.

Anonymous
tor 12-apr-2012 03:27

This is a little off topic, but could you write about the Symbolics keyboard you have?

Anonymous
fre 13-apr-2012 02:15

Awesome stuff!

Anonymous
fre 13-apr-2012 16:41

Super stuff!

/trc_wm

Anonymous
fre 13-apr-2012 16:48

Is your cache write-through or write-back?

Anonymous
fre 13-apr-2012 16:58

Loved it man, congratz!1

Anonymous
lör 14-apr-2012 01:20

Very nice. Also good to see some new c-one content ;-) Now have to dig up my C-one board.

Anonymous
lör 14-apr-2012 13:57

Hello Linus. Can you please make PCB board all in one (compatible with that board using C-One daughter board). And make simple FM music computer. And release FM Tracker as freeware or shareware.
Whole Adlibtracker2 (OPL3) community will appreciate it.
Tinctu@Gmail.Com

Anonymous
lör 14-apr-2012 14:14

very nice stuff saw it live on Rev but came here to get the soundtrack as always

//bittin

utzig
Fabio Utzig
sön 15-apr-2012 03:08

Everytime I watch this it just blows my mind! I left a lot of questions on the comments section of the youtube video which you pretty much answered all here.

One thing which is not very clear is why you chose Verilog and not VHDL. You said you used VHDL (somewhat) at university so it seems to be a reasonable choice for me. I personally quite like VHDL and never used Verilog.

Also a very general question: how have you approached learning Verilog? Did you use books, sites, irc, whatever? Which ones?

I'm from Brazil and differently from Europe, especially Germany/Sweden, there's no demoscene (groups) here. I'll say it's even hard to find people who ever heard about it at all. I personally had an Amiga in the early 90s and that's the reason that at least I know about it. One thing which always impressed me is how cool this effects are and I have no idea how you learn to program them. I remember there used to be a site hornet.org or something like that has lots of tutorials. Can you please give some points to specific learning resources?

Best regards,
Fabio Utzig

Anonymous
sön 15-apr-2012 10:31

Chipmusic.Org Parallelogram topic here:
http://chipmusic.org/forums/post/100503/#p100503

Anonymous
sön 15-apr-2012 19:48

Could you have used --update_mif in Quartus to update your RAM contents instead of recompiling the whole project?

Anonymous
ons 18-apr-2012 17:02

Awesome stuff bud. /Alfatech

Anonymous
fre 11-maj-2012 14:54

QUOTE: "However, during development, I didn't want to recompile the logic design for every little change in the demo software. After all, recompiling all the Verilog code and mapping it to the FPGA takes approximately 40 minutes (with ten shader cores and the highest optimisation settings). Hence, I placed a little bootloader in the UART interrupt, and wrote a communication tool to send a demo binary over a serial cable into the chip."

Xilinx have "data2mem" for exactly this reason, but Altera is (was?) lacking in this regard.

Have you tried the following:
quartus_cdb --update_mif

More at:
http://dbaspot.com/arch/385565-modify-pof-new-esb-rom-content-print.html

Anonymous
fre 11-maj-2012 15:00

Oh, and the Symbolics keyboard - be sure to join http://deskthority.net if you haven't already.

Anonymous
mån 21-maj-2012 00:29

This is awesome! One question: For how long were you working on this project?

Anonymous
tis 19-jun-2012 06:17

Very impressed. Thanks for posting this.

Anonymous
sön 4-nov-2012 10:47

There's only one word to describe you. Genius.

Anonymous
fre 9-nov-2012 17:34

We want the tracker and the OST in original format! Please... *w*

MP3 render is awful and full of random clicks and noises.

gbraad
Gerard Braad
mån 24-dec-2012 05:29

would love to have the schematics for the mini-board and source files. I think they can be helpful for the C-one and Turbo Chameleon community.

Anonymous
mån 24-dec-2012 08:47

[quote="utzig"One thing which is not very clear is why you chose Verilog and not VHDL.

Verilog is easier and quicker to get results...

Anonymous
ons 31-jul-2013 01:14

sparking awesome.
I am "curious" from poland, and i've hit your page when searching for simple
trackers (for making music) for atmega/uC class chips. what i've found here is well. sparking awesome :)

I had some contact with demoscene in my life, though real life dragged me far away from it since (coded few flaky zx spectrum demos) . now i mainly use computers as tools and working full time maintaining large injection molding machines which make frames for LCD tv's in korean factory. It's nice people like you keep the scene alive !
Parallel computing was always my interest, there are many aspects in theories... while you just stunningly made use of all recent tools to lightly create damn good piece of software and hardware - not limiting your creativity to libraries and cpu sets.
That's the spirit of FPGA design and you are fully into it!

what comes to my mind after seeing all that is another project i barely have time to follow - navit. It's cross-platform gps software based on OSM.
While it's nice and becomes more and more user-friendly, don't expect it to be anything than piece of frustrating annoyer along cityscape when used on hw. like google glass.
perhaps you could use few spare cycles of your brain to push it outside box of large C+ hog which barely reaches 1fps?
personally i visualise someone brutally ripping it off along with pieces of linux code and port it raw into some two-board hardware using fpga as blitter/3d accel and some arm cpu as a base, utilising bunch of nasty tricks like hardware acceleration of common functions (like decompression and parsing of OSM data) via FPGA 'coprocessor' and ofcourse video functions.

then i imagine it pushing the experience to higher level, with compass and accelerometers being able to drive heads-up display and update in decent FPS allowing augmenting of reality with navigation data like on most freaky sci-fi movies ;)

greetings and keep the spirit !

pik33
tor 15-aug-2013 07:56

Making a CPU or all SOC from scratch is the best way to learn.. so I went the same patch.
I have to learn all FPGA stuff to do my scientific word - make a neural network and/or signal filter with it. I have (university property) Altera DE2-115 board, and I started from making a VGA text mode driver (160x50 chars, 8x16 font ripped from Atari ST, attributes a'la CGA). Now it works and I started programming my own 16-bit CPU. It has 8 registers, short instruction set (only 32 instructions coded on 5 bits) and if_c and if_z style conditional execution inspired by the Propeller.

VHDL vs Verilog? I started from VHDL, but now I am a Verilog enthusiast. It is simpler. Much simpler than VHDL. And, like C, the programmer is always right. I have control, not the language.

------------

Can you release a source code of your demo or port it to DE2-115 board?

Anonymous
tis 15-nov-2016 18:44

Would love to see the tune in original format with the tracker program.

Anonymous
tis 16-okt-2018 12:28

It saddens me you still haven't released the music in its original format...

Anonymous
sön 23-maj-2021 18:38