Jump to content
m00dawg

Getting data to/from the X16

Recommended Posts

Posted (edited)
10 hours ago, Ed Minchau said:

In order for the data to get from the SD card into the X16, VERA has to put that data on one of two channels or the CPU couldn't see it at all. 

You are saying something that contradicts the Vera Programmer's Reference: the SPI data and control register are accessed at $9F3E for the data register and $9F3F for the control register (three control bits). Port A and B are accessed at $9F23 and $9F24. There is no need for Vera to "put that data on one of two channels". There is, in fact, no direct way to get that data through PortA or PortB ... the registers that are read directly are not in the Vera internal memory map. (They used to be, but not anymore.)

So ASSERTING the SPI data register ONTO the same internal data bus used by Port A and Port B would NOT BE "just a transistor". You've reached that conclusion based on a false premise.

So, setting that "just a transistor" hyperbole aside, suppose the logic is available so that the data port register can be asserted onto the internal bus used by Port A and Port B, then, yes, the CPU setting the port address to the desired Vera target minimizes the additional logic resources requires. It's not done yet, though, since triggering the "phony CPU write" on the Data port addressing the Vera RAM when the SPI countdown reaches 0 is additional logic.

But that only saves 4 clocks per byte, since the CPU still has to check the finished bit and trigger the next byte transfer. It's hardly worth the extra work.

To make it worth the trouble, you need to define a standard chunk size and work in chunks, say 16bytes at a time, so you need a 4bit countdown circuit. If it is read-only, you wouldn't need a data direction bit, but you still need a "SD to Vera block write" selection bit. The original starting byte written on MOSI could always be #0 for each cycle.

You also need additional logic for the "ready" bit, which means byte finished in regular mode and chunk finished in write chunk mode: even if that is the block write bit works as a trigger and when it is reset the transfer has completed, you need the circuit to reset it as part of the countdown circuit underflow circuit.

So if we are minimizing additional logic, the target port is dedicated, say PortB, and there is one control bit added that when set puts the SPI system into "Vera Write" mode, and when that goes 0 again, the chunk move is finished.

With the serial clock halved for the SPI (because 25MHz is faster than the SPI mode serial clock maximum of 20MHz), the write of the byte could happen in one Vera external clock, the autoincrement of the Port and the decrement of the chunk count in another, which taken together is 1 serial clock, so it would be 9 serial clocks per bit. That is  144 serial clock pulses on the 12.5MHz SCLK, or 93 CX16 clock cycles to move a 16 byte chunk, with the 65C02 doing the SPI traffic for getting the file, getting the first sector, setting up Vera, triggering the block moves, getting the next sector, and so on.

Note that while this approach maximizes re-user of existing logic resources, by the same token it locks out access by the CPU to either PortA or PortB, because it uses the same internal data bus that is connected to the motherboard data bus. So ONLY Vera control that can be managed through the direct memory mapped registers can be done while a section is transferring ... that's one reason for 16byte chunks. For instance, you can load the PCM while a "SD to Vera Write" is in process, but you cannot access the PSG registers: the PSG registers would have to be the target of the block move (which is one reason the chunk is more flexible if it is not bigger than 16).

Edited by BruceMcF

Share this post


Link to post
Share on other sites

SPI is just a glorified shift register. The FPGA only needs to use a few latches to hold a byte , and have them be read/write on the data bus pins as a byte, and be able to shift them in/out of the MISO/MOSI pins while holding a SS pin active and pulsing the clock pin.

it does this as a dumb, unthinking reflex. It doesn’t know what those bits mean any more than the shift register in an SNES joypad does.

you could implement SPI in the VIA chips just as well, but the VERA shifts bits at 25 MHz so you have a byte ready by the time the next CPU instruction is ready, whereas the VIA would be 3 times slower.

Share this post


Link to post
Share on other sites
Posted (edited)
3 hours ago, ZeroByte said:

SPI is just a glorified shift register. The FPGA only needs to use a few latches to hold a byte , and have them be read/write on the data bus pins as a byte, and be able to shift them in/out of the MISO/MOSI pins while holding a SS pin active and pulsing the clock pin.

it does this as a dumb, unthinking reflex. It doesn’t know what those bits mean any more than the shift register in an SNES joypad does.

you could implement SPI in the VIA chips just as well, but the VERA shifts bits at 25 MHz so you have a byte ready by the time the next CPU instruction is ready, whereas the VIA would be 3 times slower.

Exactly, SPI for any given mode is a parallel read/write shift register with serial input and output, designed for a specific clock and latch polarity. All the SPI in Vera adds to that is 3 register bits and a countdown to shift eight bits then stop sending the clock signal into SCLK.

In Mode 0, the serial shift register samples on the leading, rising edge of the clock and shifts out on the falling, trailing edge of the clock. Make the shift register work that way and you need no "mode" control.

For the details, SPI in VIA would be worse than 1/3 the speed ... (1) the Vera shifts bits at 12.5MHz. It could easily shift bits faster, but the SPI mode spec only requires cards to cope with serial clocks up to 20MHz in SPI mode, and 12.5MHz is the simplest rational fraction of the Vera clock less than or equal to 20MHz ...  but (2) the VIA serial shift register shifts up to half of PHI2 clock (the process is the count down plus one clock to flip the serial clock on underflow, so 2 hardware clock cycles per serial clock cycle is the fastest possible) ... so ideally it would be 1/3 as fast but (3) the VIA hardware serial shift register doesn't work well with the phase of the SD SPI mode.

The problem is that the VIA shift register is designed to shift first and sample after, so it works best with SPI Modes 1 and 3. Using it with modes 0 and 2 in hardware requires additional work and some glue logic, and the additional work adds overhead to the transfer. Or you can bit bang with no extra hardware required, but a BIG slowdown.

Even adding the circuit to load 16bytes in a row straight to Vera is substantially MORE complexity than the current SPI register itself.

Edited by BruceMcF

Share this post


Link to post
Share on other sites

The FPGA used by VERA has a SPI hard IP block (see TN 2010 iCE40 I2C and SPI Hardened IP Usage Guide) so the number of logic cells that need to be dedicated to SPI functionality should be pretty minimal. The bus timing of SD cards in SPI mode is identical to the timing in SD mode (see Section 7.8, Physical Layer Simplified Spec v8.00, SD Association). There is a TRAN_SPEED register on the card you can read to determine the maximum supported speed but only 8'h32 and 8'h5A are permitted (Section 5.3.2 of spec mentioned) specifying either 25MHz SPI in standard mode and 50MHz in high speed mode.

Share this post


Link to post
Share on other sites
9 minutes ago, Wavicle said:

The FPGA used by VERA has a SPI hard IP block (see TN 2010 iCE40 I2C and SPI Hardened IP Usage Guide) so the number of logic cells that need to be dedicated to SPI functionality should be pretty minimal. The bus timing of SD cards in SPI mode is identical to the timing in SD mode (see Section 7.8, Physical Layer Simplified Spec v8.00, SD Association). There is a TRAN_SPEED register on the card you can read to determine the maximum supported speed but only 8'h32 and 8'h5A are permitted (Section 5.3.2 of spec mentioned) specifying either 25MHz SPI in standard mode and 50MHz in high speed mode.

Just to be clear: It's not necessarily that the FPGA needs a lot of logic cells to allow an external source (such as the X16) to interact with SPI. It's that in order to add the flexibility to VERA to allow it to directly store bytes from SPI to video RAM without passing through the CPU first would require more logic cells than are currently allocated to it. Even if there are enough logic cells left to support a "fire and forget" strategy for the next X bytes, it's not as though we're dealing with a full blown multitasking friendly CPU or OS. Typically (or so it seems to me) if you have an ability to tell the hardware "transfer the next X bytes without the use of the CPU" you would generally want to signal the main system when that process is complete so that it can set up the next transfer. Given the typical implementation of the kernal, it would wind up sitting in a busy loop waiting for the signal that the transfer is done.

I see several possibilities:

1. There aren't enough logic cells available to add the functionality to support both CPU and VRAM delivery options.

2. There are enough logic cells available but it increases the complexity meaning there is another thing that could go wrong, and it doesn't really improve CPU performance because it still has to wait for the delivery notification.

3. There are enough logic cells available and the kernal becomes more complex due to dealing with an interrupt driven SPI interface so that the CPU can go on about other business while waiting for the background VRAM transfer to complete.

In a perfect world, sure, it would be nice to support this mode. I think the general purpose approach is more than adequate for most tasks, even if it isn't optimal for loading into VRAM. It's not like sales of the C=64 were too negatively impacted by its slow IEC bus protocol.

Share this post


Link to post
Share on other sites
1 hour ago, Scott Robison said:

Just to be clear: It's not necessarily that the FPGA needs a lot of logic cells to allow an external source (such as the X16) to interact with SPI. It's that in order to add the flexibility to VERA to allow it to directly store bytes from SPI to video RAM without passing through the CPU first would require more logic cells than are currently allocated to it. Even if there are enough logic cells left to support a "fire and forget" strategy for the next X bytes, it's not as though we're dealing with a full blown multitasking friendly CPU or OS. Typically (or so it seems to me) if you have an ability to tell the hardware "transfer the next X bytes without the use of the CPU" you would generally want to signal the main system when that process is complete so that it can set up the next transfer. Given the typical implementation of the kernal, it would wind up sitting in a busy loop waiting for the signal that the transfer is done.

I see several possibilities:

1. There aren't enough logic cells available to add the functionality to support both CPU and VRAM delivery options.

2. There are enough logic cells available but it increases the complexity meaning there is another thing that could go wrong, and it doesn't really improve CPU performance because it still has to wait for the delivery notification.

3. There are enough logic cells available and the kernal becomes more complex due to dealing with an interrupt driven SPI interface so that the CPU can go on about other business while waiting for the background VRAM transfer to complete.

In a perfect world, sure, it would be nice to support this mode. I think the general purpose approach is more than adequate for most tasks, even if it isn't optimal for loading into VRAM. It's not like sales of the C=64 were too negatively impacted by its slow IEC bus protocol.

The bigger challenge with such functionality may be VRAM contention. Based on the way the sprite composer works, I suspect that the VRAM is 32 bits wide clocked at 25MHz. It isn't clear what timing guarantees VERA provides; a random read requires at least two bus operations which the CPU cannot do in less than 6 or 8 cycles - I forget exactly - however a bus mastering expansion card could do those two operations in exactly two bus cycles giving us roughly 4 cycles @ 25MHz from writing ADDRx until the data needs to be on data bus without violating tDSR (maybe only 3 depending on the latency of the bus transceivers) - unless VERA can bus master and assert RDY#. The bus mastering story with X16 isn't quite clear so I am suspecting that VERA doesn't do that.

This is primarily a problem for handling random reads from the system bus; sequential reads can probably be anticipated. Writes could be posted to alleviated VRAM pressure, but that's only meaningful if the posted write FIFO can drain during HBLANK.

I could see a SPI-to-VRAM mechanism improving performance of some specific/niche activities; e.g. you could stream frames from the SD card directly to VRAM for video playback. Not sure that would be enough to justify the complexity though.

  • Like 1

Share this post


Link to post
Share on other sites
46 minutes ago, Wavicle said:

I could see a SPI-to-VRAM mechanism improving performance of some specific/niche activities; e.g. you could stream frames from the SD card directly to VRAM for video playback. Not sure that would be enough to justify the complexity though.

Agreed.

Share this post


Link to post
Share on other sites
4 hours ago, Scott Robison said:

Agreed.

 

5 hours ago, Wavicle said:

I could see a SPI-to-VRAM mechanism improving performance of some specific/niche activities; e.g. you could stream frames from the SD card directly to VRAM for video playback. Not sure that would be enough to justify the complexity though.

See, that was my incorrect assumption when I made video demos for the x16.   It seemed logical to me that transfer from SD card to VRAM would be faster than 140kb/s rather than slower.  So that's exactly what I was doing, loading tilemaps and palettes directly from SD card and playing them at 20 fps. Well, it won't work on the actual hardware. 

It's just something that wasn't explicitly stated in the documentation that loading to VRAM is slower. No mention of that speed at all, actually. 

Share this post


Link to post
Share on other sites
Posted (edited)
57 minutes ago, Ed Minchau said:

 

See, that was my incorrect assumption when I made video demos for the x16.   It seemed logical to me that transfer from SD card to VRAM would be faster than 140kb/s rather than slower.  So that's exactly what I was doing, loading tilemaps and palettes directly from SD card and playing them at 20 fps. Well, it won't work on the actual hardware. 

It's just something that wasn't explicitly stated in the documentation that loading to VRAM is slower. No mention of that speed at all, actually. 

The only reason I could see for it being SLOWER is to reduce the size of the binary, by reusing a load to RAM routine and then reuse a copy from RAM to "VRAM" routine ...

... IOW, a bespoke routine ought to be roughly the same for "LDA SPIDATA : STA PORTA" as "LDA SPIDATA : STA (TARG),Y". If the operation is too FAST, so that when the "Ready" flag is checked, it fails the first time for the one clock faster transfer and succeeds for the one clock slower one, a NOP would tune that up.

But that wouldn't entail a Vera design change, that would entail a Kernel code change. Someone who can make out what is happening with ca65 assembly code could look at what is happening with the VRAM load and see if they can come up with a revision that speeds it up and make a request to pull that version of the code.

Edited by BruceMcF

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...

Important Information

Please review our Terms of Use