Reliable synchronisation of cog PLLs

Background

Each cog in the propeller microcontroller has a video generator, which is clocked from a user configurable PLL.

Advanced video modes are typically generated by having two or more cogs work together in an interleaved fashion, taking turns preparing and transmitting video data. For this to work, it is important that the whole system is in sync.

Unfortunately, there is no proper documentation on how to achieve this, or on the video generator timing in general. The only guidance comes from code examples like the following one, from Chip Gracey's VGA 1280x1024 tile driver:

' Synchronize all cogs' video circuits so that waitvid's will be pixel-locked

        movi    frqa,#(pr / 5) << 1     'set pixel rate (VCO runs at 1x)
        mov     vscl,#1                 'set video shifter to reload on every pixel
        waitcnt cnt,d8_d4               'wait for sync count, add ~3ms - cogs locked!
        movi    ctra,#%00001_111        'enable PLLs now - NCOs locked!
        waitcnt cnt,#0                  'wait ~3ms for PLLs to stabilize - PLLs locked!
        mov     vscl,#100               'subsequent WAITVIDs will now be pixel-locked!

But the mechanism is still unclear. For instance, what is the significance of setting VSCL to one hundred at the final line, and how is that related to the comment? (VSCL is modified to a different value later on, before any pixels are emitted.)

Furthermore, as we shall see, this code doesn't even guarantee synchronisation. Occasionally, the two cogs will be off. A common workaround, used for instance in my turbulence demo, is to add some code that will detect if the synchronisation didn't work, and in that case restart the entire system.

In order to remove the need for such an ugly hack, and gain a better understanding of the video generator, I decided to approach this unchartered area systematically, with the aim of writing a 100% reliable cog synchronisation routine.

Limitations

The video generators are independent entities with their own clock sources. For my experiments, I'm using a single propeller chip (on a regular demo board) with a serial connection for downloading code and reporting measurements back to the host computer. All my measurements are taken by software running on the chip itself, by means of instructions executing within the system clock domain. It follows that I can never measure any time interval with better accuracy than ±0.5 system clock cycles. Since the PLL clocks have an undefined phase relationship to the system clock, this level of synchronisation corresponds to a jitter of a single cycle.

To illustrate, if the PLLs of cog 1 and cog 2 are running at the same frequency as the system clock, but cog 1 is slightly ahead and cog 2 is slightly late, they will appear to be one clock cycle apart, whereas the actual error is much smaller. If we try to compensate by delaying the video signal of cog 1 by one cycle, the measurements will line up, but we will have made the actual error larger.

In other words, if reason tells us that the video initialisation routine should be synchronised, but it appears to take 1000 clock cycles sometimes, and 1001 clock cycles sometimes, then the actual error is less than one clock cycle and we've reached our goal. For most video signals, this error is an order of magniture smaller than the width of a pixel. It will only be a problem if you are attempting to synchronise sub-pixel information, such as the raw chroma part of a composite signal.

Method and framework

The following framework is written in plasma syntax, which is like regular propeller assembly language, but with constant pools. There is also an explicit minimal spin header, which bootstraps the main program into cog 0. The code should be readable even if you're not used to working in plasma.

Please take a minute to study the code and comments before we move on.

You can also download the code.

                        include "propeller.i"           ' Defines some register names
zero                    =       OUTB                    ' This register is always 0
SERIALTIME              =       70937904 / 38400        ' My system clock is 16 x 4.43 MHz

' ****** Cog 0 - Control **************************************************

                        org     0

                        jmp     #entry
                        byte    XTAL1 + PLL16X
                        byte    0                       ' checksum
                        word    @progbase               ' program base
                        word    @progend                ' variable base
                        word    @progend + 8            ' stack base
                        word    @spincode               ' spin entry point
                        word    @progend + 12           ' stack end
progbase
                        word    @progend - @progbase    ' object size
                        byte    2, 0
                        word    @spincode - @progbase   ' pointer to first method
                        word    0                       ' number of variables

spincode                byte    0x35, 0x35, 0x35, 0x2c

entry
                        ' Set up the serial port

                        mov     DIRA, =1 << 30
                        mov     OUTA, =1 << 30

                        ' Main (or Measurement) loop.
mloop
                        ' First, each cog runs a disturbance program, which holds
                        ' off for a random interval of time, and then configures
                        ' PLLA to run at a random frequency.

                        ' The idea is, that if the PLLs need a couple of
                        ' milliseconds to stabilize, as suggested by Chip
                        ' Gracey's comments, then their previous configuration
                        ' might linger across a coginit, and affect the
                        ' synchronisation code.

                        coginit ci_disturb1
                        coginit ci_disturb2
                        coginit ci_disturb3
                        coginit ci_disturb4
                        coginit ci_disturb5
                        coginit ci_disturb6
                        coginit ci_disturb7

                        ' Wait until all the disturbance programs have completed.

                        call    #awaitresult
                        call    #clearresult

                        ' All cogs should start at the same system clock value, to
                        ' eliminate hub synchronisation issues, which are outside
                        ' the scope of this investigation (and well documented).

                        ' Here, we prepare such a clock value and write it into
                        ' the cog program.

                        mov     temp, CNT
                        add     temp, =512 * 16 + 200
                        wrlong  temp, =@firstcnt

                        ' Launch the sync program in seven instances.

                        coginit ci_sync1
                        coginit ci_sync2
                        coginit ci_sync3
                        coginit ci_sync4
                        coginit ci_sync5
                        coginit ci_sync6
                        coginit ci_sync7

                        ' Wait for them to complete.

                        call    #awaitresult

                        ' Loop through the results, which are stored in an array
                        ' at "result", and report them as hexadecimal values
                        ' over the serial port. These numbers indicate how much
                        ' the system clock has advanced during the synchronisation,
                        ' and should be no further apart than 1.

                        mov     count, #7
                        mov     addr, #@result
:report
                        rdlong  temp, addr
                        add     addr, #4

                        mov     nibbles, #8
:nibble
                        rol     temp, #4
                        mov     char, temp
                        and     char, #15
                        cmp     char, #10               wc
                if_c    add     char, #$30
                if_nc   add     char, #$61 - 10
                        call    #putser
                        djnz    nibbles, #:nibble

                        mov     char, #$20
                        call    #putser
                        djnz    count, #:report

                        mov     char, #$0a
                        call    #putser

                        call    #clearresult

                        ' All done. Repeat the process indefinitely.

                        jmp     #mloop

                        ' A subroutine for holding off execution until
                        ' all values in the "result" array are non-zero.
awaitresult
                        mov     count, #7
                        mov     addr, #@result
:loop
                        rdlong  temp, addr
                        tjz     temp, #awaitresult
                        add     addr, #4
                        djnz    count, #:loop

awaitresult_ret         ret

                        ' A subroutine for clearing the "result" array.
clearresult
                        mov     count, #7
                        mov     addr, #@result
:loop
                        wrlong  zero, addr
                        add     addr, #4
                        djnz    count, #:loop

clearresult_ret         ret

                        ' This code transmits a byte over the serial interface.
putser
                        or      char, #$100
                        shl     char, #1
                        mov     bits, #10
                        mov     CNT, CNT
                        add     CNT, =SERIALTIME
:bits
                        shr     char, #1                wc
                        muxc    OUTA, =1 << 30
                        waitcnt CNT, =SERIALTIME
                        djnz    bits, #:bits

putser_ret              ret

                        ' Specification words for coginit.

ci_sync1                long    @sync           >> 2 << 4 |     (@result + 0)           >> 2 << 18 |    1
ci_sync2                long    @sync           >> 2 << 4 |     (@result + 4)           >> 2 << 18 |    2
ci_sync3                long    @sync           >> 2 << 4 |     (@result + 8)           >> 2 << 18 |    3
ci_sync4                long    @sync           >> 2 << 4 |     (@result + 12)          >> 2 << 18 |    4
ci_sync5                long    @sync           >> 2 << 4 |     (@result + 16)          >> 2 << 18 |    5
ci_sync6                long    @sync           >> 2 << 4 |     (@result + 20)          >> 2 << 18 |    6
ci_sync7                long    @sync           >> 2 << 4 |     (@result + 24)          >> 2 << 18 |    7

ci_disturb1             long    @disturb        >> 2 << 4 |     (@result + 0)           >> 2 << 18 |    1
ci_disturb2             long    @disturb        >> 2 << 4 |     (@result + 4)           >> 2 << 18 |    2
ci_disturb3             long    @disturb        >> 2 << 4 |     (@result + 8)           >> 2 << 18 |    3
ci_disturb4             long    @disturb        >> 2 << 4 |     (@result + 12)          >> 2 << 18 |    4
ci_disturb5             long    @disturb        >> 2 << 4 |     (@result + 16)          >> 2 << 18 |    5
ci_disturb6             long    @disturb        >> 2 << 4 |     (@result + 20)          >> 2 << 18 |    6
ci_disturb7             long    @disturb        >> 2 << 4 |     (@result + 24)          >> 2 << 18 |    7

                        pool

addr                    res     1
temp                    res     1
char                    res     1
count                   res     1
nibbles                 res     1
bits                    res     1

                        fit     496

' ****** Hub memory variables *********************************************

shiftreg                long    1
result                  long    0[7]

' ****** Cog 1..7 - Disturbance program ***********************************

disturb
                        org     0

                        ' Hold off for a random number of system clocks, then
                        ' configure the PLL for a random frequency. Note that
                        ' there are no hub operations between the delay and the
                        ' configuration, as that would introduce an unwanted
                        ' regularity.

                        call    #random
                        mov     freq, rand_out

                        call    #random
                        and     rand_out, =$7fffff              ' 0..120 ms
                        add     rand_out, #8                    ' avoid race condition
                        add     rand_out, CNT
                        waitcnt rand_out, #0

                        mov     FRQA, freq
                        mov     CTRA, =%0_00001_111_00000000_000000_000_000000

                        ' Signal to cog 0 that we are done.

                        wrlong  one, PAR

:done                   jmp     #:done                          ' Wait forever

                        ' Lock zero protects a global LFSR which is used to
                        ' generate a pseudo-random bit sequence.
random
:getlock                lockset zero                    wc
                if_c    jmp     #:getlock
                        rdlong  rand_bits, #@shiftreg

                        mov     rand_count, #32
:loop
                        test    rand_bits, =$a0000003   wc
                        rcr     rand_bits, #1           wc

                        rcl     rand_out, #1
                        djnz    rand_count, #:loop

                        wrlong  rand_bits, #@shiftreg
                        lockclr zero

random_ret              ret

one                     long    1

                        pool

rand_bits               res     1
rand_count              res     1
rand_out                res     1

freq                    res     1

                        fit     496

' ****** Cog 1..7 - Synchronisation code **********************************

sync
                        org     0

                        ' Make sure that all cogs start at exactly the same moment.

                        waitcnt firstcnt, #0

        ' ********************************************************
        ' This is where we add the synchronisation code under test
        ' ********************************************************

                        ' At this point, the video generators should be synchronised,
                        ' and we should be able to execute waitvid instructions
                        ' with predictable timing...

                        waitvid zero, #0
                        waitvid zero, #0
                        waitvid zero, #0
                        waitvid zero, #0
                        waitvid zero, #0

                        ' ...and always end up with a consistent value for the
                        ' system clock here, apart from single-cycle jitter.

                        mov     timediff, CNT
                        sub     timediff, firstcnt
                        wrlong  timediff, PAR

:done                   jmp     #:done                  ' Wait forever

firstcnt                long    0

                        pool

timediff                res     1

                        fit     496

progend

The purpose of testing this on seven cogs in parallel is to catch any glitches that might occur due to process variations when the PLLs were manufactured.

Now, let's plug in some synchronisation code and see what happens!

Approaching stability

Naïve solution

Let's begin with a naïve solution, where we expect things to just work:

                        ' Configure the video generator. I've chosen a regular VGA
                        ' mode, but it shouldn't make any difference.
                        mov     VCFG, =%0_01_0_00_000_00000000000_010_0_11111111

                        ' Set the PLL frequency. We will use a typical setup where
                        ' counter A runs at 1/16 of the system clock frequency, and
                        ' the PLL multiplies this by 16 to get back to the system
                        ' clock frequency again.
                        mov     FRQA, =$10000000

                        ' Start PLLA.
                        mov     CTRA, =%0_00001_111_00000000_000000_000_000000

                        ' Select a frame size. Pixel size doesn't affect timing.
                        mov     VSCL, =1234

Here is the output from a typical run:

00001471 000016a0 00001683 0000152d 0000168f 000016d8 0000145e
00001474 00001692 0000166f 0000150d 00001657 00001690 000018da
000017cc 0000150a 000014cf 00001837 00001499 000014ca 000016fc
000016b4 000018b1 00001863 000016e7 00001810 00001838 00001578
000014b6 0000169d 00001641 000014bf 000015ca 000015f2 000017f0
000018a0 000015ab 0000154d 00001887 000014b2 000014c8 000016a9
000016aa 0000186e 00001802 00001657 00001744 00001750 0000144c
000014ec 000016a5 0000162b 00001467 00001550 0000153e 000016fe
000014ef 000016a5 000015ff 00001437 00001516 000014f4 0000169f
000014f2 0000169d 000015e6 0000140a 000014d5 00001496 00001638
000014f4 000016ad 000015c2 000018ae 0000149b 0000144d 000015d4
000017c2 000014a7 0000188a 00001682 0000172d 000016d9 00001842
000015ec 0000178d 0000168e 0000146e 00001519 000014af 00001605
...

Clearly, the cogs are not synchronised.

Waiting for the PLL to settle

Gleaning at the example code given at the top of this page, we learn that we have to wait for about 3 ms until PLLA has reached a stable state.

We modify our synchronisation code to the following:

                        mov     VCFG, =%0_01_0_00_000_00000000000_010_0_11111111
                        mov     FRQA, =$10000000
                        mov     CTRA, =%0_00001_111_00000000_000000_000_000000

                        ' Wait 3 ms for PLLA to settle.
                        mov     CNT, CNT
                        add     CNT, =$30000
                        waitcnt CNT, #0

                        mov     VSCL, =1234

The average execution time is now 3 ms longer, but the amount of jitter remains high:

00031609 00031416 00031519 000315d6 0003166b 00031540 000318d1
0003140d 000316d4 000317db 00031887 00031434 000317c6 0003166d
00031705 000314e2 000315eb 00031680 000316e8 000315a0 0003142f
000316ad 00031478 0003156e 000315ed 0003164e 000314fb 0003183d
000315d2 00031856 0003146f 000314db 00031528 00031896 00031703
00031434 000316a4 0003178a 000317e3 00031815 000316a8 000314ff
000317c2 00031556 0003162d 00031670 00031695 0003151c 00031834
000316a4 0003142d 000314f6 0003151a 0003153a 00031883 000316a7
00031520 0003177f 0003182e 00031835 00031852 000316b8 000314c8
00031875 000315f7 00031696 00031689 00031691 000314eb 000317b6
000316b7 00031447 000314b2 0003149b 00031497 000317ab 00031592
00031600 00031841 000318b9 00031880 00031869 000316a5 0003146e
000318d8 00031634 0003169c 0003164c 0003162b 00031459 000316e2
...

Forcing the video generator to reload on every cycle

The reason is clear: Initially, VSCL is zero, which is interpreted as a frame size of 4096 PLL clocks according to the specification. If the PLL is completely unreliable when just started, this means that the video generator will be anywhere in this 4096-cycle window when we hit the first waitvid instruction, and so there will be a random delay.

To prevent this, we select a frame size of 1 before the 3 ms delay:

                        mov     VCFG, =%0_01_0_00_000_00000000000_010_0_11111111
                        mov     FRQA, =$10000000
                        mov     CTRA, =%0_00001_111_00000000_000000_000_000000

                        ' Force video generator to reload on every clock cycle
                        mov     VSCL, #1

                        ' Wait 3 ms for PLLA to settle.
                        mov     CNT, CNT
                        add     CNT, =$30000
                        waitcnt CNT, #0

                        mov     VSCL, =1234

The output is:

0003183c 0003136a 0003183c 0003136a 0003183c 0003183c 0003183c
0003136a 0003183c 0003136a 0003136a 0003183c 0003183c 0003136a
0003183c 0003183c 0003183c 0003183c 0003136a 0003136a 0003183c
0003183c 0003183c 0003136a 0003183c 0003183c 0003183c 0003136a
0003183c 0003136a 0003136a 0003136a 0003183c 0003183c 0003183c
0003136a 0003183c 0003183c 0003136a 0003183c 0003136a 0003183c
0003136a 0003183c 0003136a 0003183c 0003136a 0003183c 0003136a
0003136a 0003183c 0003183c 0003136a 0003136a 0003136a 0003183c
0003183c 0003136a 0003183c 0003183c 0003183c 0003183c 0003183c
0003136a 0003183c 0003183c 0003136a 0003183c 0003136a 0003183c
0003136a 0003136a 0003183c 0003183c 0003183c 0003183c 0003183c
0003183c 0003183c 0003136a 0003183c 0003136a 0003136a 0003136a
0003136a 0003183c 0003136a 0003136a 0003183c 0003183c 0003136a
...

If we let the code run for a long time, we can express the result as a histogram like this:

OccurencesExecution time
100030ea7
100030eaa
100030f29
100031379
1000313bc
1000313fa
1000313fc
20003137a
40003136c
4000317fb
2300031369
500003183b
108000003136a
160320003183c

We're beginning to see some regularity here, but something strange is going on. In nearly all cases, the execution time seems to be one of two values, exactly 1234 cycles apart. The longer execution time seems to occur with a 60% probability, and the shorter execution time with a 40% probability.

Chip Gracey's code doesn't seem to have this problem. Let's try a frame size of 100, and see if that helps at all:

                        mov     VCFG, =%0_01_0_00_000_00000000000_010_0_11111111
                        mov     FRQA, =$10000000
                        mov     CTRA, =%0_00001_111_00000000_000000_000_000000

                        ' Force video generator to reload on every clock cycle
                        mov     VSCL, #1

                        ' Wait 3 ms for PLLA to settle.
                        mov     CNT, CNT
                        add     CNT, =$30000
                        waitcnt CNT, #0

                        mov     VSCL, =100

Output, as a histogram:

OccurencesExecution time
100030152
1000301d5
1000301d6
10003114d
200030171
2000301d1
200030211
300030191
3000301d2
20000301b1
4100030215
10529000301b2
1578400030216

Well, duh. The variance is now 100 instead of 1234. Why is this code completely unreliable, when the example code in the VGA tile driver is able to get it right nearly every time?

Offseting the first frame

The last instruction in the test code is a mov VSCL, =1234 (or 100). This instruction latches a large frame size into VSCL during stage 5 of the instruction pipeline. Immediately afterwards, in the framework code, there is a waitvid instruction, which will either block (because the video generator has already fetched the new frame size, along with whatever pixel data that happened to be on the bus, and started to emit the large frame) or execute immediately (because the video generator was just about to fetch a new pixel and frame size, but hadn't gotten around to it yet).

For the observed phenomenon to occur, the waitvid instruction must take its block-or-no-block decision very close to stage 5 of the previous instruction. This would be during the clock cycle following the instruction fetch, i.e. before the pixels and colours are loaded from cog ram, and thus before the actual blocking takes place. This is somewhat strange, but it would actually explain the known, weird behaviour wherein a perfectly timed (i.e. non-blocking) waitvid seems to deliver the wrong pixels to the video generator.

The non-determinism comes from the fact that the signal from the video generator is asynchronous; PLLA will try to stay at a constant phase offset from the system clock, but will vary ever so slightly due to external factors such as radio interference. If the initial phase of PLLA had been completely uncorrelated to the phase of the system clock, the probability of blocking would have been 50%. The actual probabilities (40% vs 60%) indicate that the phase difference tends to be near a certain value, probably corresponding to a propagation delay inside the PLL.

So if the large variance happens because the VSCL assignment and the waitvid instruction are executing simultaneously due to pipelining, then we should be able to fix the problem by introducing a nop in between them.

                        mov     VCFG, =%0_01_0_00_000_00000000000_010_0_11111111
                        mov     FRQA, =$10000000
                        mov     CTRA, =%0_00001_111_00000000_000000_000_000000

                        ' Force video generator to reload on every clock cycle
                        mov     VSCL, #1

                        ' Wait 3 ms for PLLA to settle.
                        mov     CNT, CNT
                        add     CNT, =$30000
                        waitcnt CNT, #0

                        mov     VSCL, =100
                        nop			' Flush the pipeline

Output, as a histogram:

OccurencesExecution time
1000301f2
2000301b5
2000301b6
2000301d2
2000311b1
3000301d6
3000301f6
4000301d5
400030211
20000301f5
16100030215
6708000030216

Yes! We seem to have stabilised the system in 99.94% of the cases (disregarding the single cycle jitter that is the only difference between the two final entries in the table). The crucial difference between Chip Gracey's code and our own is clear: We must avoid a non-blocking waitvid at every cost, and we do that by stepping into a large frame size, and then executing any instruction other than waitvid. I'm pretty confident that this hasn't been mentioned in any of the official documentation from Parallax.

Ok, our code is now as good as that of Chip Gracey. But the behaviour is not perfect. Very rarely, we seem to be losing a whole bunch of clock cycles, and on two occasions 3995 extra cycles appeared out of nowhere. What's going on?

The VSCL race condition

Let's add a column to the histogram, expressing the error, i.e. the difference in clock cycles from the most common case.

OccurencesExecution timeError
1000301f236
2000301b597
2000301b696
2000301d268
2000311b1-3995
3000301d664
3000301f632
4000301d565
4000302115
20000301f533
161000302151
67080000302160

A critical observation is, that although the errors are very rare, several of the bad values occur more than once. This suggests that there are a few distinct ways in which the error manifests itself.

As mentioned in the introduction, a single cycle error is to be expected from the measuring process. This means that we can polish our figures by adding or removing one from each row in the table, in order to get a clearer picture of the real error. For instance, the two most common measurements ($30215 and $30216) should probably be treated as a single case, with some single-cycle jitter on top.

The error is infrequent (0.06%), so it must be happening during a very brief moment. When you think about it, the only critical moment is during the switch-over from VSCL = 1 (before cog/video synchronisation) to VSCL = 100 (after cog/video synchronisation). Let's assume that the error occurs because the video generator is somehow getting the wrong value when reading VSCL.

We can extend the table with a column showing what value the video generator must have been seeing instead of 100, in order for the observed error to occur. The value is also listed in binary form. I've adjusted the error by manually adding or removing a single cycle where it seemed appropriate.

OccurencesExecution timeErrorAdjusted errorFrame sizeFrame size (binary)
1000301f23636641000000
2000301b5979640000100
2000301b6969640000100
2000301d26868320100000
2000311b1-3995-39960 (4096)0000000
3000301d66464360100100
3000301f63232681000100
4000301d56564360100100
40003021154961100000
20000301f53332681000100
16100030215101001100100
6708000030216001001100100

"When the light went on it nearly blinded me."

Can you see the pattern? When the VSCL register is written, it's being changed from 1 to 100. This means that bit 0 has to go from high to low, while bits 2, 5 and 6 have to go from low to high. The bits are changing at slightly different rates. If VSCL is sampled at precisely the critical moment, some of the bits have changed, and some of them have not.

Suppose we pick a value that is more similar to 1 at the bit level?

Suppose we pick 65 (1000001)?

                        mov     VCFG, =%0_01_0_00_000_00000000000_010_0_11111111
                        mov     FRQA, =$10000000
                        mov     CTRA, =%0_00001_111_00000000_000000_000_000000

                        ' Force video generator to reload on every clock cycle
                        mov     VSCL, #1

                        ' Wait 3 ms for PLLA to settle.
                        mov     CNT, CNT
                        add     CNT, =$30000
                        waitcnt CNT, #0

                        mov     VSCL, =65       ' Any power of two, plus one
                        nop			' Flush the pipeline

By going from 1 to 65, we've reduced the number of ambiguous bits to one (bit 6). Therefore, the number of possible values that can be read out of VSCL is reduced to two: 1 and 65. These two cases are equivalent to the two scenarios we've already got, e.g. when the video generator triggers just before the critical moment and just after the critical moment. In other words, we've reduced the large error into a special case of the single cycle error, which is below the threshold of what can be measured.

The result:

OccurencesExecution time
9000030166
6972800030167

Mission accomplished!

Other frequencies

I have verified that the synchronisation code works with the following different values of FRQA and PLLDIV.

The single cycle jitter behaves differently for the various frequencies, and sometimes disappears completely, but keep in mind that this does not translate into useful information about the phase, and should be regarded as measurement noise.

FRQAPLLDIV
0f000000111
0fff0000111
10000000111
12340000111
19000000111
1c000000111
0fff0000110
10000000110
12340000101

Optimising the code

Now that we understand the various sources of error, it's easy to rewrite the synchronisation code into a more compact form.

Lowering the delay to 1 ms seems to work fine, although your mileage may vary on this one, so it could be a good idea to keep it at 3 ms unless you need a really quick initialisation.

Here is the final code, as you might use it in a video driver. sync_cnt should be set to CNT + $2100 just before coginit is invoked.

I'm not claiming ownership over this routine (i.e. it's public domain), but I recommend you to leave the URL in place anyway, as a form of documentation.

                        ' Synchronise cog PLLs. Here there be dragons.
                        ' http://www.linusakesson.net/programming/propeller/pllsync.php

			waitcnt	sync_cnt, =$30000	' 3 ms
                        mov     VCFG, =%0_01_0_00_000_00000000000_010_0_11111111
                        movi    FRQA, #$10000000 >> 23
                        mov     VSCL, #1                ' Reload on every pixel
                        movi    CTRA, #%0_00001_111_00000000_000000_000_000000 >> 23
                        waitcnt sync_cnt, #0            ' Wait for PLL to settle
                        mov     VSCL, #65               ' Any power of two, plus one

                        ' The next instruction must not be a waitvid, and should
                        ' probably not be an assignment to VSCL either.

Posted Tuesday 2-Nov-2010 17:38

Discuss this page

Disclaimer: I am not responsible for what people (other than myself) write in the forums. Please report any abuse, such as insults, slander, spam and illegal material, and I will take appropriate actions. Don't feed the trolls.

Jag tar inget ansvar för det som skrivs i forumet, förutom mina egna inlägg. Vänligen rapportera alla inlägg som bryter mot reglerna, så ska jag se vad jag kan göra. Som regelbrott räknas till exempel förolämpningar, förtal, spam och olagligt material. Mata inte trålarna.

Anonymous
Thu 4-Nov-2010 20:05
Excellent work!

Two "improvements"
1. Ensure the sync_cnt delay is large enough for the initial 4096 PLL clocks to expire so the initial VSCL=1 gets picked up before the VSCL=65 gets set. (I don't know if there's a way to do a waitcnt after the VSCL=1, then somehow get all of the cogs to agree on a new sync_cnt value).
2. When setting VSCL also set the pixel counter to 1 (i.e. VSCL = 1<<12 + 1 and VSCL = 1<<12 + 65) so the pixel and frame counters stay in sync.

Eric Ball
(I need to go back and tweak my multi-cog driver.)
lft
Linus Åkesson
Tue 9-Nov-2010 06:31
Excellent work!

Two "improvements"
1. Ensure the sync_cnt delay is large enough for the initial 4096 PLL clocks to expire so the initial VSCL=1 gets picked up before the VSCL=65 gets set. (I don't know if there's a way to do a waitcnt after the VSCL=1, then somehow get all of the cogs to agree on a new sync_cnt value).
2. When setting VSCL also set the pixel counter to 1 (i.e. VSCL = 1<<12 + 1 and VSCL = 1<<12 + 65) so the pixel and frame counters stay in sync.

Eric Ball
(I need to go back and tweak my multi-cog driver.)

Thank you!

1. Yes, this can be done when the video signal clock is an integer multiple of the system clock (such as when FRQA is $10000000). Otherwise, it's important that CTRA is assigned simultaneously in all cogs, and then you need both of the waitcnts.

2. The pixel counter is always reset when a new frame begins, so this is not necessary. It may or may not make the code easier to understand.
Anonymous
Tue 11-Jan-2011 03:13
1. Ensure the sync_cnt delay is large enough for the initial 4096 PLL clocks to expire ...

FWIW, there are *no* initial 4K cycles to expire. The initial delay (after power up) is specific to propeller chip and chosen cog simply because the frame counter isn't reset. Subsequent delays are based on what was in the frame counter when the cog is stopped or the video h/w is switched off manually. Note that the frame counter keeps it's state across cogstop and even short periods of power loss.

http://forums.parallax.com/showthread.php?128148-Ideas-or-theory-on-odd-problem-related-to-spin-program-startup&p=963147&viewfull=1#post963147

kuroneko
lft
Linus Åkesson
Mon 17-Jan-2011 16:18
1. Ensure the sync_cnt delay is large enough for the initial 4096 PLL clocks to expire ...

FWIW, there are *no* initial 4K cycles to expire. The initial delay (after power up) is specific to propeller chip and chosen cog simply because the frame counter isn't reset. Subsequent delays are based on what was in the frame counter when the cog is stopped or the video h/w is switched off manually. Note that the frame counter keeps it's state across cogstop and even short periods of power loss.

http://forums.parallax.com/showthread.php?128148-Ideas-or-theory-on-odd-problem-related-to-spin-program-startup&p=963147&viewfull=1#post963147

kuroneko

Very interesting! For the purpose of synchronisation it doesn't matter, as we only have to cover the worst case, which is still 0 (4096). But it's a neat little side effect that could be exploited for e.g. copy protection if one were so inclined.
Anonymous
Wed 6-Apr-2011 11:14
Linus,

The reason I did a "mov vscl,#100" while the video generator was reloading on each pixel clock was to hand it a vscl value, which would be grabbed immediately, which would occupy the video shifter until the user-chosen vscl values and waitvid instructions were ready to go in each cog. It just puts the initial important reload 100 pixels out, so there's time for each cog to set its own initial vscl value and get into its own waitvid before it's show-time. I agree there could be some synchronization problem on issuance of the vscl=100, but if that is synchronous across all cogs, all subsequent waitvids should be forever sync'd. It might have been my dumb luck that the FRQ values I was using never got me near a metastability point that could have caused the "mov vscl,#100" to not take across all cogs simultaneously.

Have you witnessed any synchronization problems with any of the multi-cog video drivers that I wrote? I haven't, myself, and I've never heard anyone else say so. I'm just curious.

I really liked that Turbulence demo you wrote. I hope you stay interested in the Propeller, as the next version has perspective-corrected texture mapping with RGB lighting and alpha-blending, all in 8:8:8 RGB. There's also a lot of math hardware for fast transcedentals. You would be able to make it do lots, I know. It has 75-ohm 9-bit DACs on every pin, too, so you get nice video. You can attach an SDRAM for big memory (APA graphics), as well.

Thanks.

Chip Gracey
Parallax, Inc.
lft
Linus Åkesson
Fri 1-Jul-2011 12:10
Linus,
Have you witnessed any synchronization problems with any of the multi-cog video drivers that I wrote? I haven't, myself, and I've never heard anyone else say so. I'm just curious.

Hi Chip!

Well, I haven't been able to try your video drivers, since the propeller development software is windows-only, which means I'm stuck with homemade tools. Writing an assembler was straight forward, but I'm not motivated to write a spin compiler.

However, I do get metastability problems when using your synchronisation technique verbatim, as documented on this page. It seems far-fetched that the surrounding code would somehow make it stable, but this is of course speculation either way.

I really liked that Turbulence demo you wrote. I hope you stay interested in the Propeller, as the next version has perspective-corrected texture mapping with RGB lighting and alpha-blending, all in 8:8:8 RGB. There's also a lot of math hardware for fast transcedentals. You would be able to make it do lots, I know. It has 75-ohm 9-bit DACs on every pin, too, so you get nice video. You can attach an SDRAM for big memory (APA graphics), as well.

Thanks.

Chip Gracey
Parallax, Inc.

Thank you! Yes, the next version of the Propeller is definitely interesting!
Anonymous
Wed 16-Aug-2017 20:34
Hello Linus!

Outstanding post. Really exposes a lot of esoteric under-the-hood behavior to us, especially with the issue of bits not all changing before being grabbed.

I'm in the process of developing a VGA driver a la NES-style graphics backend on the Propeller, and this is an outstanding start to a multi-cog video system.

Hopefully I can pick your brain down the road on some mundane aspects of the driver, if I can't find answers elsewhere.

Cheers!