Jump to content

REU: the poor man's blitter.


zBeeble
 Share

Recommended Posts

So ... I've been watching with interest, but not contributing because I didn't see anything I needed to say.  I do now.

I think the function of the 64/128 REU's memory transfer function is under-appreciated.  It's like a budget blitter chip.  Your memory design has 2M of memory in an 8k bottleneck.  Something along the lines of the REU that could:

- memory <--> bank
- bank <--> vera
- memory <--> vera

might make things massively more functional and enable a slate of games that wouldn't be possible otherwise.  One flaw in the single 8k bank is that you can't effectively have segmented code and segmented data in a program ---- you have to choose one.

I looked at REU implementation for the MiSTeR FPGA project --- it's not a lot of FPGA code (or space) to do that.

Anyways... just a comment.

Link to comment
Share on other sites

45 minutes ago, zBeeble said:

So ... I've been watching with interest, but not contributing because I didn't see anything I needed to say.  I do now.

I think the function of the 64/128 REU's memory transfer function is under-appreciated.  It's like a budget blitter chip.  Your memory design has 2M of memory in an 8k bottleneck.  Something along the lines of the REU that could:

- memory <--> bank
- bank <--> vera
- memory <--> vera

might make things massively more functional and enable a slate of games that wouldn't be possible otherwise.  One flaw in the single 8k bank is that you can't effectively have segmented code and segmented data in a program ---- you have to choose one.

I looked at REU implementation for the MiSTeR FPGA project --- it's not a lot of FPGA code (or space) to do that.

Anyways... just a comment.

The problem is that the FPGA doesn't have direct access to the 64K primary address space or the 2M banked memory. Everything that goes on in the FPGA happens in a 32 byte window in the IO area. Thus it wouldn't be able to do stash / fetch / verify.

Link to comment
Share on other sites

I didn't explicitly say you had to use that FPGA and I realize that the current design precludes it.  But vera -> bank would allow direct loading of the bank from media... so it makes sense.

The REU + GEOS (about the only thing other than TMP that used it) ... the thing that made it rock was that the REU could blit an 8k page is 8192 cycles (or 1/1000th a sec at your 8Mhz speed .... a lowly 1/100th-ish at the C64's speed).

I'm just saying ... that I get leaving things out.  I get minimalism.  But the REU blit function is simple, concise and easy to understand ... and extraordinarily powerful.  And it's period authentic, to boot.

And... yes... you could make a card (likely, I haven't looked at your slot diagram) that can at least, I assume, do the base 64k access --- but (and I'm assuming here) it would be missing the access to the 2M of memory _and_ the 128k of video memory.  So having it in the base system makes the most sense.

Anyways... I have no say here --- just saying it's a missed opportunity.

Link to comment
Share on other sites

3 hours ago, zBeeble said:

I didn't explicitly say you had to use that FPGA and I realize that the current design precludes it.  But vera -> bank would allow direct loading of the bank from media... so it makes sense.

The REU + GEOS (about the only thing other than TMP that used it) ... the thing that made it rock was that the REU could blit an 8k page is 8192 cycles (or 1/1000th a sec at your 8Mhz speed .... a lowly 1/100th-ish at the C64's speed).

I'm just saying ... that I get leaving things out.  I get minimalism.  But the REU blit function is simple, concise and easy to understand ... and extraordinarily powerful.  And it's period authentic, to boot.

And... yes... you could make a card (likely, I haven't looked at your slot diagram) that can at least, I assume, do the base 64k access --- but (and I'm assuming here) it would be missing the access to the 2M of memory _and_ the 128k of video memory.  So having it in the base system makes the most sense.

Anyways... I have no say here --- just saying it's a missed opportunity.

One, just to be clear ... I am not one of the designers. I have no say, just expressing thoughts too.

Two, I'd love to see something like that myself, but it couldn't be done with the existing FPGA because it doesn't have the space or connections at present that would be required.

The one thing the Commander X16 has going in its favor over is a higher clock rate, but of course that will still take multiple clock cycles per byte to copy from a RAM bank.

There are lots of things that would be cool to see, it's just a matter of money at this point.

Link to comment
Share on other sites

1 hour ago, Scott Robison said:

The one thing the Commander X16 has going in its favor over is a higher clock rate, but of course that will still take multiple clock cycles per byte to copy from a RAM bank.

A fair bit of this is lost because of the preponderance of 16 bit data in the video system, which the 6502 does not handle very efficiently.

Link to comment
Share on other sites

8 hours ago, zBeeble said:

And... yes... you could make a card (likely, I haven't looked at your slot diagram) that can at least, I assume, do the base 64k access --- but (and I'm assuming here) it would be missing the access to the 2M of memory _and_ the 128k of video memory.  So having it in the base system makes the most sense.

Except access to the base 64K IS access to the 2M of High RAM and the 128k of video memory, since they are all accessed within the 64K address space.

Link to comment
Share on other sites

58 minutes ago, BruceMcF said:

Except access to the base 64K IS access to the 2M of High RAM and the 128k of video memory, since they are all accessed within the 64K address space.

Well... no... that is: no in many ways.  I suppose you could make your card flipity-flip through it... but this would deny the action of one monolithic act.  Not only that --- it would probably break if the CPU was using the existing mapping in any way.  So many ways... no.

Even the choice to have one mapping region (rather than, say, 2 4k mapping regions) has a huge out-size effect.

Let's say you decide to divide your big app into code segments.  In the 40k, then, you need a smart dispatch and _all_ the data.  If you use the pages for code, you can't effectively use them for (other than local) data.

Ok... restart.  you decide to have data segments.  Cool.  Then _all_ your code (and some data) need be in the 40k.  This works for small programs, but falls down as code size approaches 40k.

The REU is a dead simple interface and gives the user a wide menu of uses with a dead-simple interface.  I forget if it's start+end or start+count ... + direction ... but same difference.  And it copies one byte per cycle.

Your best loop is going to be

loop:
     LDA $SRC
     STA $DST
     INC $loop+1
     BNZ $g1
     INC $loop+2
g1:
     INC $loop+4
     BNZ $g2
     INC $loop+5
g2:
     DEX
     BNZ $loop
     DEY
     BNZ $loop

What did we do here?  the 16 bit number with x low byte and y high byte is the count.  We're using memory modification (silly on modern computers, but fast on these old chips) for the SRC and DST.  For talking to the vera, you save 2 of the 4 INCs.

So... let's count the cycles... we'll only count the cycles where none of the 16 bit rollovers are happening.

LDA - 4, STA - 4, INC - 6 * 2, BNZ - 3 * 3, DEX - 2.  So that 22 cycles for Vera and 31 for regular memory.

... so something that is around a few dozen lines of verilog gives you a move operation that is between 22 and 31 _times_ faster than your other options.

some of you will chime in about ,X addressing.  I always wrote my loops this way... Just did it again by habbit, I suppose.  The argument is still sound --- one byte per cycle is unimaginably fast w.r.t the 65c02.

 

Link to comment
Share on other sites

So you have a bus mastering card in a slot. Like the original REU was a cartridge.

Even if you have a dead simple REU, access to the 64K is all you need. What you need is two MODES: one of which works with a range of RAM, the other works on a single RAM address.

You toggle Bank 18 of High RAM in, suck 8K into the REU, toggle Bank 30 in, blast 8K from the REU. The overhead of the CPU setting the Bank is trivial for moving 8K that fast. If THAT is what you mean by "I suppose you could make your card flipity-flip through it... but this would deny the action of one monolithic act" ... treating the overhead of a CPU action to change banks ONCE PER 8,192 bytes as "denying the action of one monolithic act" is silly ... the program will be USING the segments in units of banks, the SD card routine will be LOADING them into one or more consecutive banks, loading data into a launch pad section of banks to get it into the REU is the sensible way to operate anyway.

And there's absolutely not even "once per 8K" flippity-flip" for the REU, because of the way the port access works. You set up the autoincrement in the Vera for the data port you are going to use and the type of data load you are going to do, and point the REU in "Port Channel Mode" to either DataA or DataB. Now you can blast in a whole bitmap, or text screen, or bitmap row, or bitmap column, or palette, or PCM buffer, etc.

Indeed, one thing the card would want to have would be the ability to work in SMALLER units in order to avoid stalling interrupts for such a long period. 8K is probably too MUCH to do at once ... it's probably fine to have a byte register for size of move and move 1-256 bytes at a time. Then if 256 cycles is not going to mess up your music player or whatever (because you are setting up), you move things a page at a time, in 256 clocks plus loop overhead, and if you need more granularity, it's available ... if you know that 32 clock cycles is the longest you can hold off interrupts without messing with the music player in the middle of a game, you move things 32 bytes at a time, in 32 clocks plus loop overhead.

But just don't clear the target and internal address registers at the end of a cycle and restore the count register to what it was at the beginning, and it chains perfectly well.

As you say, it is not very much system resources in a programmable logic chip. It is conceivable that an REU with a 512K RAM chip and a high speed two channel USART interface could fit into a CPLD.

  • Like 1
Link to comment
Share on other sites

But the solution you propose still doesn't deal with the CPU using the 8K bank at the time.  I'm just saying there's a real reason for a design change.

I don't think we've discussed whether the Vera can take a store (or fetch) per cycle, either... for that matter.

Link to comment
Share on other sites

Well, during blit operations, the CPU would be suspended, so it doesn't really matter what it was doing at the time. It's like hypersleep for long space voyages. You get in the freezer, and when you get out what feels like mere moments later, it's been 18 years and you're halfway across the galaxy. Sure, your kids grew up while you were gone, but you knew you were signing up for that when you applied for the mission. Same for mass blits. So long as the REU leaves the RAM/ROM/VERA pointers in whatever states they were in when it started, then no harm will come to the software using a DMA controller card installed in the expansion slots. If the controller doesn't do that - then you could SEI , do DMA, fix any bank / vera pointers needed, then CLI.

And as @BruceMcF points out - having to switch RAM banks during a block copy operation is of negligible impact. You'd just need to build the DMA controller to know about the banking structure and issue the appropriate bank swap writes and update its src/dst pointers accordingly.

 

Link to comment
Share on other sites

20 hours ago, ZeroByte said:

And as @BruceMcF points out - having to switch RAM banks during a block copy operation is of negligible impact. You'd just need to build the DMA controller to know about the banking structure and issue the appropriate bank swap writes and update its src/dst pointers accordingly.

Since I am "imagining doing" the REU, I certainly was NOT having the DMA controller know ANYTHING about banking structure ... I was having the CPU handle that.

So, one register has the chunk size control, maybe the control register, maybe another. The control register has whatever is has, but the readable bit 0 is "start" when set to 1, remains 1 for the operation (which doesn't matter because the CPU is asleep), and goes to 0 when completed. Point the REU address to the source, the target address to $A000. The bank is in A and the number of chunks is in X. Y is transient. "0" in X implies a count of 256.

PASTEBANK:  LDY $0 : PHY : STA $0  : LDA REUCONTROL : ORA #1 : - STA REUCONTROL : DEX : BNE - : PLA : STA $0 : RTS

My Interrupt code has to not change the bank without storing it, but the longest an interrupt has to wait is 35 cycles, because in effect STA REUCONTROL is a 35 clock cycle instruction that copies 32 bytes.

The overhead is 8 clocks on top of 32, so 25% overhead. If you want lower overhead, make the chunk bigger. At 128 byte chunks, it's 6.3% overhead. At 256 byte chunks, it's 3.2% overhead. So I don't see any particular reason why it would ever need to be bigger than 128 byte chunks ... making that a loop adds 3 bytes to multiply the block moved by up to 256, so a maximum 128 chunk covers 32K, where the maximum Bank move is 8K before time to increment the bank register. In the above, if interrupts can touch transient zero page API space and this is rommable code so I can't just store after this routine, I can have 128 byte chunks, if I am not worried about interrupt lagginess, X will have 64 in it, and Y can say have many blocks:

PASTEBANKS: SEI : STA $20 : LDA $0 : PHA : LDA $20 : CLI : STA $0 : LDA REUCONTROL : ORA #1 : PHX : -- PHX : PLX : - STA REUCONTROL : DEX : BNE - : INC $0 : DEY : BNE -- : PLX : PLA : STA $0 RTS

(Unless I have the meaning of SEI and CLI reversed ... it's been over 40 years) ... A power of two chunk size is 0 for 1 byte through to 7 for 128 bytes, so three bits in the REUCONTROL register for chunk size. 1 bit for increment target versus stable target, 1 bit for direction (copy into REU, paste into CX16) ... we still have two bits in the REU control register. Two bytes for CX16 address,  three bytes for REU address If we have a 512K SRAM in the REU, and plenty of room for expansion if people decide they want bigger ones.

 

 

Link to comment
Share on other sites

I think that’s fancier than it needs to be. If I were making such a device I’d just have a few registers: src/dst_addr, src/dst_stride, src/dst_bank, and a 16-bit num_bytes register. Finally, a go register that has bit flags for latching behaviors of the DMA parameter registers upon completion, where 1=reset to beginning value, 0=leave them at the final value.

If you use zeros for the start_DMA value, when DMA is complete, the addresses in the src/dst regs will be one stride past the last byte, and num_bytes will be zero, and if the banks switched during transfer, the src/dst will be the ones where the next src/dst bytes live.

 So if you wanted to do writes a page at a time, you could just write 255 into num_bytesand set a 1 flag for the num_bytes field in your “begin DMA” write. it would pick up right where it left off.

DMA into VERA this way is just setting dst_addr to $9f23 and dst_stride to 0. No special mode is needed.

As for banking, I think it might make sense that if during the DMA, it should halt if a src or dst pointer underflows from zero, or strides forward from base RAM to $A000+
leaving the bytes remaining in the num_bytes register.

if either src or dst starts in a bank window, then it uses the values in src/dst bank and if it strides out of the window, wrap the pointer around and inc/dec the bank #.

This device wouldn’t need any RAM of its own to do these DMA transfers. To me, if it were to have its own RAM, it should have gobs of it, like 32Mb so it adds value , like a RAMdisk or something, and that RAM should be referenced as a flat blob so no dealing with banks internal to the REU.

Let the programmer decide whether it’s a good or bad idea to DMA 128k in one shot or in smaller chunks to allow other operations. If I’m on a loading screen anyway, I don’t need to keep stopping and saying “are we there yet?”

Maybe I want to hop into the hyper sleep pod and don’t thaw me out until we’re in the Andromeda galaxy. Yes, I know my grandkids will be old by the time we’re there, but that’s the mission I signed up for.

 

Link to comment
Share on other sites

VERA would still need a second DMA (otherwise you get bus contention issues with its internal VRAM) and much more than 32 bits of interface with System RAM to make this work.

 

Hindsight on matters like these is always 20/20.

Link to comment
Share on other sites

9 hours ago, ZeroByte said:

I think that’s fancier than it needs to be. If I were making such a device I’d just have a few registers: src/dst_addr, src/dst_stride, src/dst_bank, and a 16-bit num_bytes register. Finally, a go register that has bit flags for latching behaviors of the DMA parameter registers upon completion, where 1=reset to beginning value, 0=leave them at the final value.

If you use zeros for the start_DMA value, when DMA is complete, the addresses in the src/dst regs will be one stride past the last byte, and num_bytes will be zero, and if the banks switched during transfer, the src/dst will be the ones where the next src/dst bytes live.

 So if you wanted to do writes a page at a time, you could just write 255 into num_bytesand set a 1 flag for the num_bytes field in your “begin DMA” write. it would pick up right where it left off.

DMA into VERA this way is just setting dst_addr to $9f23 and dst_stride to 0. No special mode is needed.

As for banking, I think it might make sense that if during the DMA, it should halt if a src or dst pointer underflows from zero, or strides forward from base RAM to $A000+
leaving the bytes remaining in the num_bytes register.

if either src or dst starts in a bank window, then it uses the values in src/dst bank and if it strides out of the window, wrap the pointer around and inc/dec the bank #.

This device wouldn’t need any RAM of its own to do these DMA transfers. To me, if it were to have its own RAM, it should have gobs of it, like 32Mb so it adds value , like a RAMdisk or something, and that RAM should be referenced as a flat blob so no dealing with banks internal to the REU.

Let the programmer decide whether it’s a good or bad idea to DMA 128k in one shot or in smaller chunks to allow other operations. If I’m on a loading screen anyway, I don’t need to keep stopping and saying “are we there yet?”

Maybe I want to hop into the hyper sleep pod and don’t thaw me out until we’re in the Andromeda galaxy. Yes, I know my grandkids will be old by the time we’re there, but that’s the mission I signed up for.

 

Yeah, I was avoiding a stride, to make it simpler. With a stride in the Vera, it's not necessary to have a stride in the data created to go into the Vera.

And letting the count be smaller, so make it simpler, even if the VHDL covers over the extra complexity.

So it seems to me you are saying what I was sketching is more complicated and then replacing it with a more complicated one, to avoid having the built in REU RAM, when I never even mentioned the more complicated part! (yet)

But a design without its own RAM is half the speed when filling the PCM with pre-designed chunks of data, or when filling Vera bitmaps with pre-designed chunks of data. I would not want to give up the chance at that 1 byte per clock, rather than 1 byte per two clocks, if I have the ability for it to be 1 byte per clock right there.

And having those residing IN the REU frees up storing them in the CX16 RAM, so it frees up anywhere up to 512K RAM in the CX16, which is anywhere from 25% to 100% of High RAM.

 

Edited by BruceMcF
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...

Important Information

Please review our Terms of Use