In my plipbox project a fairly fast AVR 8-bit MCU with 16 MHz was connected to the Amiga’s parallel port to transfer incoming and outgoing IP packets from/to the attached Ethernet controller. A protocol on the parallel port was devised to quickly transmit the bytes in both directions. In version 0.6 a data rate of up to 240 KB/s was achieved… The question now arises if this is the top speed we can get or is the parallel port capable of more?
This blog post shows the results of my experiments I performed with the parallel port on my Amiga. It tries to show different classes of transfers possible on this port and gives the achievable maximum speed of each class.
Since the available documents and data sheets are all lacking the exact description of the I/O part on the peripheral side of the device, this blog post is also an effort to try to document this undocumented side of the parallel port (or: “What you always wanted to know about your CIA 8520 and never dared to ask”)
1.1 The CIA 8520 and the Parallel Port
The Amiga has two custom chips called the CIAs 8520 (Complex Interface Adapter) that are called CIA A and CIA B. A CIA chip has two I/O ports (Port A, Port B) with 8 bits each that can be individually configured for peripheral input or output.
The parallel ports pins consists of three kinds of pins:
- 8 Data Pins (In or Out), Pin 2-9
- 1 Strobe Line (Pin 1), 1 Ack Line (Pin 10): Hardware Handshake
- 3 Control Lines (BUSY, POUT, SELECT) (Pin 11, 12, 13)
Those pins are connected to the two CIAs as follows:
- CIA A, Port B: 8Data Pins
- CIA A, PC and F to Â Strobe and Ack for Hardware Handshake
- CIA B, Port A, Bits 0,1,2: BUSY, POUT, SELECT
While CIA A Port B handles the data pins, CIA B Port A handles the 3 control lines. Note that the other bits of this port are connected to serial portÂ lines.
In the Amiga memory map both CIAs are mapped to different memory ranges. Here is an excerpt with the registers useful for parallel port programming. (See the Amiga Hardware Reference Manual, Appendix F for a complete list)
Address Name Default Description BFE101 prb 0xff Parallel port BFE301 ddrb 0x00 Direction for port B (BFE101);1=output (can be in or out)
BFD000 pra 0xff /DTR /RTS /CD /CTS /DSR SEL POUT BUSY BFD200 ddra 0xc0 Direction for Port A (BFD000);1 = output (set to 0xFF)
The data direction register (DDR) for both ports set a bit of the port either to input or output. The logic of the data pins is not inverted, i.e. a 1 in a register is a high (5V) value on the line.
The default values indicate the setup after the Amiga has booted and sets all parallel pins to input.
1.2. The CIAs in the Amiga system
The CIA chip is compared to the MC680xx CPU clock of an Amiga a fairly slow device. It can handle a clock rate of up to 1 or 2 MHz while the CPU runs at 7 or more MHz. The MC68000 CPU architecture offers a special mode of device access for these devices that is based on a slower clock called the E clock. It runs at the 1/10th of the CPU clock speed.
Lets see some numbers:
- CPU Clock F_CPU =Â Â 7.16 MHz (NTSC) Â 7.09 MHz (PAL)
- E Clock F_ECLK = F_CPU / 10 = 716 KHz (NTSC) Â 709 KHz (PAL)
- E cycle length t_ECLK = 1.40 us (NTSC) 1.41 us (PAL)
This means the CPU accesses the CIA with at most the speed of F_ECLK. An access is a read or write to a register. So when we transfer data we either read or write the data register of CIA A Port B. If we only access this register the top speed we can ever achieve on this port is one byte per F_ECLK or 716/709 KB/s max!
If you look in the data sheet of the CIA’s ancestor device called the MOS 6526Â you will see that the E Clock interval is divided into two sections: a HI and LOW range of the clock interval. While in the HI range (4/10 of t_ECLK) the CPU accesses the device, in the LO range (6/10 of t_ECLK) the device starts to realize the change set by the CPU, i.e. if a port is on output it will set the pins low or high accordingly. On a read the data has to be stable on the port before the HI phase will access it from the CPU.
Here are the numbers (naming according to the 6526 data sheet):
- t_CHW (Clock High Width) = 4 / 10 * t_ECLK = 560 ns (NTSC) Â 564 ns (PAL)
- t_CLW (Clock Low Width) = 6 / 10 * t_ECLK = 840 ns (NTSC) Â 846 ns (PAL)
Some interesting limits of the 6526 chip:
- t_PD (Output Delay on Write): max 1 us
- t_PS (Port Setup Time on Read): min 300 ns
The t_PD of max 1 us results in port setups that may take almost the whole E cycle of 1.4 us and it overlaps the next HI range for CPU access.
<- t_ECLK -> t_CHW ____ ____ --| |______| |______| t_CHL CPU |-------> Write t_PD
1.3 Hardware Handshake with Strobe
The parallel port offers two pins for hardware handshaking called Strobe and Ack. The hardware handshake allows to signal the external peripheral whenever new data has been set (or read!) on the external port of the CIA. After data is valid the strobe line sends a short pulse (low active) on Strobe to signal the receiver. It will then read the data byte and acknowledge the transfer by pulling Ack low. The Amiga detects the Ack pulse either by polling or by interrupt and then transmits the next byte.
While strobing (i.e. generating the Strobe pulse) after a Port B read/write happens automatically, you have to manually trigger Ack to confirm it.
Lets see a time sheet with some E clock cycles (.H, .L being the high and low range of the cycle)
ECycle CPU CIA Port B Strobe 0.H Write Port B=42 - H 0.L - 42! H 1.H - 42! H 1.L - 42 L 2.H - 42 L 2.L - 42 H
This examples writes a byte with value 42 to Port B. The CIA realizes this value in the next to sub cycles (denoted with !) and beginning with 1.L a stable value of 42 is available on the output of port B. Then strobe goes low for a full E cycle length.
We see that strobe has to be delayed otherwise a peer reading on falling edge of strobe won’t have a stable data signal.
The interesting questions that now arise are:
- What is the strobe delay in cycles of the 8520 CIA on the Amiga?
- What is the strobe width in cycles?
- How fast can we transfer data and still get valid strobes?
The answer to the first one can be found in the Amiga Hardware Reference Manual, Appendix F, Section Handshaking):
PC will go low on the third cycle after a port B access.
But the other ons are unanswered in the docs. So its time for some experiments…
2. My Experiments
My Setup is an Amiga 500 with ACA500 and ACA1230/33 Accelerator attached. A plipbox device was attached with running version 0.6 firmware unless otherwise stated.
2.1 Setup Port
Using ASMone I quickly hacked some code to set the parallel port to data output and all lines to low/zero:
lea $bfe101,a0 ; parallel port data lea $bfe301,a1 ; parallel port ddr move.b #$ff,(a1) ; all bits to output move.b #$00,(a0) ; set all lines to low/zero
2.2 Writing a byte
With the port setup lets conduct the first experiment: Write a $ff byte to the parallel port and capture the lines with a logic analyzer. The code:
Â lea $bfe101,a0 move.b #$ff,d0 move.b d0,(a0)
The scope triggered on falling edge of strobe:
First interesting fact we see here is the strobe width: Its 2.813 us or 2 * t_ECLK!
So the Strobe width of the CIA 8520 is (in contrast to 6526’s 1E) 2 E long! t_SW = 2 E
Lets repeat the write. Now write a $00 on a port that has been initialized with $ff:
Notable difference here is the point in time when the port signal changes:
- LO->HI: late at end of cycle
- HI->LO: early at the beginning of the cycle
Note: the markers are aligned to begin of strobe (falling edge) in 1 E steps (i.e. 1.4 us)
If we assume that strobe starts with the LO range of the E cycle then the markers and the begin of strobe denote the HI->LO transition inside an E cycle.
If we compare these lines with the typical E cycle diagram of a data sheet then they denote the center of the cycles and not the borders!
^ visible marker ^ strobe falling edge | | |<-- t_CHW -->|<--- t_CHL --->|<-- t_CHW -->|<--- t_CHL --->|... |------- E cycle 0 -----------|------- E cycle 1 -----------| | H>L changes L>H | signal changes
With this shift in mind we can conclude that the actual CPU write of this byte has happened right left of the first marker in the lower image (i.e. in t_CHW).
Lets write down the strobe sequence in a time sheet: the $00 write
EClock CPU PortB Strobe Annotation 0.H w00 ff H 0.L - 00! H realizing 00 on port 1.H - 00* H already 00 stable on port 1.L - 00 H \ safety range 2.H - 00 H / 2.L - 00 L strobe begin 3.H - 00 L \ strobe width: 2 E cycles 3.L - 00 L / 4.H - 00 L strobe end 5.L - 00 H
and the $ff write:
EClock CPU PortB Strobe Annotation 0.H wff 00 H 0.L - ff! H realizing ff on port 1.H - ff! H needs this range, too 1.L - ff H 2.H - ff H 2.L - ff L 3.H - ff L 3.L - ff L 4.H - ff L 5.L - ff H
We can see the strobe starting in the third cycle as stated in the docs. It keeps a safety range of one E cycle after setting up the values before beginning the strobe.
2.3 Writing multiple bytes in a row
What will happen if we write two or more bytes in a row (i.e in each E cycle a byte) to the strobe signal?
Let’s see and write two bytes (port again setup with $00):
lea $bfe101,a0 move.b #$ff,d0 moveq #$00,d1 move.b d0,(a0) ; write in 0.H move.b d1,(a1) ; write in 1.H
The time sheet:
EClock CPU PortB Strobe Annotation 0.H wff 00 H --- Marker 0.L - ff! H realizing ff on port (slow) 1.H w00 ff! H needs this range, too --- Marker 1.L - 00! H realizing 00 on port (fast) 2.H - 00* H 2.L - ff L regular strobe begin (1st E) 3.H - ff L 3.L - ff L (2nd E) 4.H - ff L regular strobe end 4.L - ff L extended strobe begin (3rd E) 5.H - ff L extended strobe end 5.L - ff H
What do we see?
- A strobe of length 3 * E!Â So the first write’s strobe and the second one is somewhat merged now.
- The $ff write happens right before the left marker and is established on the port inside the two marker’s range. Â (LO-HI transition = slow)
- The $00 write happens right before the right marker and is established right after the marker. (HI-LO transition = fast)
- The $ff value is only valid at the end of the two marker’s interval!
Let’s write 4 bytes in a row:
- Still a Strobe of 3E cycle length!Â The strobe width is not enlarged, no matter how many bytes you send. Seems that the strobe logic gets stuck.
- Data $ff, $00, and $ff is valid at the end of the E ranges around the falling edge of strobe
Now 4 bytes starting with $00 (port was $ff):
Same result here:
- A 3E Strobe and nothing more!
- Data again valid at the end of the E cycles around falling edge of strobe
To sum up this experiment: While we can write to the CIA from the Amiga with E cycle speed, the resulting strobe signals are not useable anymore! However, all data values appear on the port lines (in fragments of the E cycle).
Let’s call the non-stop writes to the CIA 1E TransfersÂ and let’s experiment now with transfers that take more E clock cycles in the next experiments.
Data transfer speed of 1E Transfers is E clock speed, i.eÂ 716/709 KB/s
What is the lowest xE transfer that generates useable strobes?
2.4 2E Transfers
Ok, we need to make a pause between the data write from the Amiga. To be precise we want to wait for multiples of the E clock. The best way to perform a “wait” on (or better waste) an E clock cycle is to actually perform a register access to one of the CIAs. Make sure to perform an access with no side effects, so reading a port A (i.e. does not strobe) already does the trick.
A 2E transfer code now does a write (1E cycle) and one pause (second 1E cycle) looks like this:
lea $bfe101,a0 lea $bfd000,a1 ; let's use CIAB Port A to "waste" E cycles move.b #$ff,d0 moveq #$00,d1 move.b d0,(a0) ; 1E write in 0.H tst.b (a1) ; 1E waste cycle by reading register (1.H) ; =2E transfer per byte move.b d1,(a1) ; write in 2.H tst.b (a1) ; waste E cycle (3.H)
- Strobe is back again at 2E. But only the first one is visible! All others are gone 🙁
- Complete range of 1E port data valid (1E range for port setup)
- Note: instead of reading a “waste” value in the second E access to the CIA, you can also perform a single control signal write. In the picture above I toggled the SEL signal. This gives you an exact location of the 1.H, 3.H, … locations and can be used on the receiver side as a sync signal! (Very useful since strobe is broken here)
- Note2: If you toggle SEL (or POUT, BUSY) you can only write the Port A (but not read it beforehand).Â Therefore, a signal update of only the parallel line bits won’t work. In fact you have to ignore serial line bits in the same port and write them always to a constant value -> Serial lines don’t work with 2E transfers !! or in other words: There is no system friendly way to implement it…
- Data transfer speed of 2E is half of 1E: 354.5 – 358 KB/s
EClock CPU PortB Strobe Annotation 0.H w55 ff H 0.L - 55! H realizing aa on port 1.H <waste> 55! H needs this range, too 1.L - 55 H 2.H waa 55 H 2.L - aa! L regular strobe begin 3.H <waste> aa! L --- Marker 3.L - aa L 4.H w55 aa L regular strobe end 4.L - 55! H 5.H - 55! H --- Marker 5.L - 55 H
2.5 3E Transfers
Since 2E transfers still have broken strobe output, lets add another “wasted” cycle and setup a 3E transfer. With two spare E cycle accesses in our transfer loop we can also use the two cycles to perform a read/modify/write operation to a register. E.g. a bclr (bit clear) or bset (bit set) operation can be used to modify a control line of the parallel port and is then used as a “clock” line for our data transfer.
lea $bfe101,a0 lea $bfd000,a1 ; let's use CIAB Port A to "waste" E cycles move.b #$ff,d0 moveq #$00,d1 move.b d0,(a0) ; 1E write data tst.b (a1) ; 2E waste cycles tst.b (a1) move.b d1,(a0) ; 1E write data bclr d1,(a1) ; 2E cycles to clear "clock" line (bit 0) move.b d1,(a0) ; 1E write data bset d1,(a1) ; 2E cycles to set "clock" line
A scope plot of a 3E transfer:
- Ah! Now we have valid strobes! Makes sense: timing per byte is now 3E with 2E for (fixed) strobe size and 1E for the spacing between strobes.
- Data transfer speed for a 3E transfer is a third of the 1E speed: 236 – 239 KB/s Â
- The current 0.6 plipbox implementation uses a 3E transfer method and achieves the calculated limit of about 240 KB/s.
Time sheet 3E Transfer:
EClock CPU PortB Strobe Annotation 0.H w55 ff H 0.L - 55! H realizing $55 on port 1.H <waste1> 55! H needs this range, too 1.L - 55 H 2.H <waste2> 55 H 2.L - 55 L regular strobe begin 3.H waa 55 L 3.L - aa! L 4.H <w1> aa! L regular strobe end 4.L - aa H 5.H <w2> aa H 5.L - aa L next strobe begin 6.H w55 aa L 6.L - 55! L 7.H <w1> 55! L next strobe end 7.L - 55 H
Note: you can see that the first value (here $55) is valid during H->L falling edge of first strobe. Thats the point of time when the external device reads the value.
You can now continue to add waste cycles and introduce 4E, 5E, … transfers. But they do not really make sense as they only move the strobe further apart. You cannot really use the extra E cycles…
Here is an example of a 4E transfer:
Note the 4E strobe cycle: 2E strobe and 2E spacing between strobes.
2.6 What about read transfers?
In the above experiments I always talked about writing bytes to the port. But what changes if we want to read data with 1E, 2E, or 3E transfers?
- Strobing is essentially the same. After a read operation the strobe will be generated.
- The device feeding the port needs to setup the data to be read before the .H cycle that performs the CPU read operation
Here is a time sheet of a 3E read:
EClock CPU PortB Strobe Annotation -1.H - 11! H (save setup time) -1.L - 11! H device sets up data on PortB 0.H r11 11 H CIA reads PortB 0.L - 11 H 1.H <waste1> 11 H 1.L - 11 H 2.H <waste2> 22! H (save setup time) 2.L - 22! L device sets new data on PortB 3.H r22 22 L CIA reads PortB 3.L - 22 L 4.H <w1> 22 L regular strobe end 4.L - 22 H 5.H <w2> 22 H
- The external device needs to setup data right before the CPU access. While the .L sub cycle before the read might suffice for stable read it is more safe to already setup data in .H before
- If you use a parallel port control line to “clock” the data you can set the line before the first CPU read and start reading with the first byte.
- If you want to use the strobes to sync your reads then you have a problem: The strobe signal arrives _after_ the read! To get in sync with this signal you must use a trick: first perform a dummy CPU read just to generate a strobe and then use this strobe to sync your device’s writes:
- In the above time sheet we dummy read at 0.H
- The device already sets up data 0x22
- The CPU performs the next read at 3.H and gets 0x22
- The device waits for the raising edge of strobe (4.H – 4.L) and sets the next data
- Reading in 2E and event 1E gets more difficult as in the worst case no “clock” signal is available and you have to use a sampling pattern with fixed E size to setup the data in time from the device. It is still open if it possible to write a stable 1E transport this way.
- In most reader code the interrupts have to be disabled on Amiga side otherwise the clocked setting up of data before a read might arrive too late and thus a CPU read gets wrong.
This (rather long) blog article shows you all the details when transferring data over the parallel port at the maximum possible speed. We discovered some interesting anomalies with strobe generation at these high transfer rates.
I introduced a new speed classification for the parallel transfer types called 1E, 2E, or 3E transfers.
The top speeds achievable with the xE transfers are:
1E:Â 709..716 KB/s 2E:Â 355..358 KB/s 3E:Â 236..239 KB/s
Current plipbox version 0.6 implements a 3E transfer using external control lines for clocking. I am currently experimenting with a 3E transfer using only strobes as signalling (it frees control lines for other functions). Another interesting coding exercise will be a 2E or even a 1E transfer… Now the technical background is available!
When doing keyboard handshake, you set KEYB_DAT (connected to CIA-A pin SP) low by setting the SPMODE bit in CRA to output.
On CIA-B, SP is connected to the BUSY signal on the parallel port, so perhaps toggling SPMODE can be used to create a 2E sync signal.