Proposal for a hardware-agnostic math accel API

m00dawg · Post by **m00dawg** » Thu Feb 15, 2024 1:59 am

I've been chatting about this on the Discord a bit in #homebrew-hardware and thought I'd collect my initial thoughts after batting some ideas around on what might be kind of interesting, at least as a thought experiment, of what a hardware agnostic math accelerator API might look like which could be implemented any number of ways.

Couple points first though:

My work on DreamTracker is far more important as I march towards a beta release, so this is mostly a fun thought exercise at the moment
Part of the exercise is to learn about FPGAs and RISCV. Being able to tinker with both with the X16 just seemed interesting even if not practical since I can break things into small logical chunks (I'll explain in a moment). I can also play with these for some non-X16 projects though so I have options (once I have time anyway ).
One of the things I love about the X16 is it's a simple base with which to build on and that base will likely be what most folks end up using. As a result, I'm not trying to reinvent the base architecture here and in fact I think it's incredibly important at this stage to keep the X16 as-is (beyond perhaps the 65816 upgrade given it may not require any hardware changes though even that I see as a bonus and not a requirement). This post isn't really changing the X16 itself.
That said, since X16 has an expansion bus it makes for a fun platform to tinker on just for fun and perhaps this weirdo idea might turn into a nice solution for cartridges or become a popular enough expansion card it might see adoption

With that said, I've been wanting to dabble in RISCV assembly now that I feel like I know 6502 pretty well. I admire RISCV for what it can do both in the small space like a retro computer all the way up to servers. It's practical, MIPS-like (which is what I studied a bit back in college) and open. Seems awesome! And in for my non-X16 projects is a direction I would like to head. My "WaveBoy" Eurorack module could be well realized with a RISCV softcore plus some dedicated logic hardware which can constantly read through the 4-bit/32-byte wave data which seems super interesting to me and a great way to reacquaint myself with FPGAs (I had to use them a bit in college) and RISCV. I should say the current fast ARM solution (which uses an Adafruit ItsyBitsy M4) works just fine. But I wanted to get more low level. It's to do it because I can, not necessarily because I need to.

Anyways, while looking into things like FemtoRV, UpDuino (and other Lattice-based FPGAs) for the above, I soon realized some of these have enough pins to connect to the entire X16 bus (including some important additional pins, like IRQB, RDY, etc.). So my first/grand idea was to sort out what a RISCV co-processor on a card might look like. It would run inside an FPGA also with a soft-SOC, have it's own RAM and map the X16's RAM into its larger 32-bit address space. The idea being the X16 CPU will still be used for KERNAL calls where possible (some things might have be implemented in RISCV if needed, like a keyboard routine, else control would be passing back and forth) but issuing a call to the MMIO address of the card could instruct some front-end logic to pause the 6502 (via RDY), perhaps setup the value of the return, and then pass control to the RISCV core.

It seems like a crazy idea, but a neat one, though it's beyond my capability and is fairly esoteric in many ways. Given conversations on the Discord have come up about accelerator cards, I started thinking about ways to simplify this while still providing benefit. Just so happens the Lattice FPGA used by VERA is the same one used by the Upduino I'll be using initially. It has some on-board SRAM among other nice things. Looking at how the VERA is structured, it kinda points to an interesting solution.

Instead of hooking a RISCV up to the whole bus, one can just hook up the minimum part of the bus for the IO space (the IO pin plus the lower 5 address bus lines) and pass data back and forth through MMIO. In this way, one has a rudimentary set of registers with an instruction and status - so more or less a special purpose accelerator processing unit.

All told I came up with this:

Code: Select all

$9Fx0: Command / "Opcode"
$9Fx1: Control
$9Fx2: Status
$9Fx3: Operation Width
$9Fx4-$9Fx7: I1-I3 (Operand(s) 1)
$9Fx8-$9FxB: J1-J3 (Operand(s) 2)
$9FxC-$9FxF: R1-R3 (Result(s))

As a crude example, one might do something like the following:

Code: Select all

; We want to multiply 2 numbers
lda #RV_MULTIPLY
sta RV_COMMAND

; Set precision/width (2 bytes)
lda #$02
sta RV_WIDTH

; Load I
lda #$12
sta RV_I
lda #$34
sta RV_I + 1

; Load J
lda #$56
sta RV_J
lda #$67
sta RV_J + 1

; With everything setup, tell the APU to go
lda #RV_EXECUTE
sta RV_CONTROL

; Wait for the results. This is a loop but could go do something else until the interrupt fires
@wait_for_result:
    lda RV_STATUS
    beq @wait_for_result

; Do something with the results
lda RV_RES
...
lda RV_RES + 1
...

I've been saying FPGA here but it was pointed out there are potential non-FPGA solutions (like the RP2040) which could respond the same way. Enter the "agnostic" part.

This could be used for both simple or complex things, like multiply, divide, floating point, matrix/vectors even, etc. The implementation details could be up to the designer implementing the API. In my case, I could either implement a RISCV soft-core and write the math in software or could make some bespoke logic structures directly on the FPGA making a custom simple processor. From the standpoint of the X16, it just sees some memory addresses.

An more extreme case blending the original idea with this one, is having some operations optionally throw data directly to VERA (by pausing the 6502). That would require some additional address lines (though still not the entire bus since VERA is up in the I/O space as well) but if needing to do complicated matrix calculations or some such, that might be perhaps an option. Nice thing about an API over an entire RISCV coprocessor sitting on the bus, is the API can be more easily implemented in the emulators as well without an entire RISCV soft-core and all the complex handling of passing control back and forth.

It also could be used in a cartridge as an accelerator. This is where the RP2040 was mentioned given the cost and performance that would probably be in close to performance of the FPGA given the X16 bus speeds.

And while I think the RISCV coprocessor idea is neat, and maybe I'll explore it one day, I think the above is a lot more attainable and in my current level of skill while helping me learn the FPGA concepts and is something I'll be pursuing I think at some point once I get used to the Upduino and open source tooling, and have time to put it on an expansion card (probably Kevin's protoboard first).

Anyway it might be a silly idea but I thought I'd collect my thoughts to share with folks. If nothing else, it shows one of the powerful features of the X16 (the exposed bus and expansion cards). And if you got all the way to the bottom of this diatribe, you're awesome!

m00dawg · Post by **m00dawg** » Thu Feb 15, 2024 2:27 am

Spent a bit of time after posting the above mapping out how instructions might work. In FPGA (or perhaps some RISCV extensions maybe) I can do multiple operations at the same time if I'm simply building logical elements. That made me think of doing MMX style operations where an operation can be carried out at the same time. That means one could add 4 set of numbers all at once.

I came up with something like this:

Code: Select all

# Instruction Format:

INST    : Instruction which honors precision (WIDTH)
*S      : Instruction which operates each byte (8-bit) pair simultaneously (I[0] and J[0], e.g.)
*D      : Instruction which operates each 2-byte (16-bit) pair simultaneously (I[0-1] and J[0-1], e.g.)
*Q      : Instruction which operates on a single 32-bit pair (I[0-3] and J[0-3])

Unless using instructions with precision, the operation is carried out on ALL of I and J 
and placed in R. In this way, it's possible to do the same calculation across sets of
operands. For instance, ADDS would do:

I0 + J0 = R0
I1 + J1 = R1
I2 + J2 = R2
I3 + J3 = R3

These are individual operations run together, a bit like MMX/SSE/Vector instructions.

The instructions which honor precison only operate on the values up to the precision. So setting
the WIDTH register to 2, and issuing ADD would do:

I[0-1] + J[0-1] = R[0-1]

It would not touch the values in I[2-3], J[2-3] or R[2-3]

(Point of discussion: if/how to handle carry or not)

# Add: Add numbers in I and J, place result in R
ADD
ADDS
ADDD
ADDQ

# Subtract: Subtract numbers in I from J, place result in R
SUB
SUBS
SUBD
SUBQ

# Multiply: Multiply numbers in I and J, place result in R
MUL
MULS
MULD
MULQ

# Divide: Divide numbers in I by J, place result in R
DIV
DIVS
DIVD
DIVQ

# Swap Upper/Lower: Swaps the top half bits with the bottom half
SWAP
SWAPS
SWAPD
SWAPQ

# And: Bitwise AND between I and J, placing the result in R
AND
ANDS
ANDD
ANDQ

# Or: Bitwise OR between I and J, placing the result in R
OR
ORS
ORD
ORQ

# Floating Point IEEE-754
(TBD)

These would be separate operands in the above (so each their own COMMAND) but the WIDTH register could be used instead which makes what ends up being instruction decoding more complicated but allows for 4x more individual commands (instructions) which could be used.

Not sure which operations would be useful since of course the 6502 can already do ANDs, ORs, etc. but being able to do it across multiple bytes is an interesting thought.

LovelyA72 · Post by **LovelyA72** » Thu Feb 15, 2024 2:45 am

I think the main "selling point" of this solution is that it can be embedded on to a game cartridge for cheap(e.g. RP2040 costs 70 cents in volume!) and enables game-specific enhancements just like Nintendo SA-1 in some SNES game cartridges.

m00dawg · Post by **m00dawg** » Thu Feb 15, 2024 3:52 am

Yep, IO7 is earmarked as being used for Cartridges first for things just like this.

Spent some time transcribing my ramblings into a Markdown which can be found here: https://www.dreamtracker.org/#ideas/apu/

BruceRMcF · Post by **BruceRMcF** » Thu Feb 15, 2024 4:04 am

m00dawg wrote: ↑Thu Feb 15, 2024 2:27 am Spent a bit of time after posting the above mapping out how instructions might work. In FPGA (or perhaps some RISCV extensions maybe) I can do multiple operations at the same time if I'm simply building logical elements. That made me think of doing MMX style operations where an operation can be carried out at the same time. That means one could add 4 set of numbers all at once.

I came up with something like this:
Code: Select all
# Instruction Format:

INST    : Instruction which honors precision (WIDTH)
*S      : Instruction which operates each byte (8-bit) pair simultaneously (I[0] and J[0], e.g.)
*D      : Instruction which operates each 2-byte (16-bit) pair simultaneously (I[0-1] and J[0-1], e.g.)
*Q      : Instruction which operates on a single 32-bit pair (I[0-3] and J[0-3])

Unless using instructions with precision, the operation is carried out on ALL of I and J 
and placed in R. In this way, it's possible to do the same calculation across sets of
operands. For instance, ADDS would do:

I0 + J0 = R0
I1 + J1 = R1
I2 + J2 = R2
I3 + J3 = R3

These are individual operations run together, a bit like MMX/SSE/Vector instructions.

The instructions which honor precison only operate on the values up to the precision. So setting
the WIDTH register to 2, and issuing ADD would do:

I[0-1] + J[0-1] = R[0-1]

It would not touch the values in I[2-3], J[2-3] or R[2-3]

(Point of discussion: if/how to handle carry or not)

# Add: Add numbers in I and J, place result in R
ADD
ADDS
ADDD
ADDQ

# Subtract: Subtract numbers in I from J, place result in R
SUB
SUBS
SUBD
SUBQ

# Multiply: Multiply numbers in I and J, place result in R
MUL
MULS
MULD
MULQ

# Divide: Divide numbers in I by J, place result in R
DIV
DIVS
DIVD
DIVQ

# Swap Upper/Lower: Swaps the top half bits with the bottom half
SWAP
SWAPS
SWAPD
SWAPQ

# And: Bitwise AND between I and J, placing the result in R
AND
ANDS
ANDD
ANDQ

# Or: Bitwise OR between I and J, placing the result in R
OR
ORS
ORD
ORQ

# Floating Point IEEE-754
(TBD)
These would be separate operands in the above (so each their own COMMAND) but the WIDTH register could be used instead which makes what ends up being instruction decoding more complicated but allows for 4x more individual commands (instructions) which could be used.

Not sure which operations would be useful since of course the 6502 can already do ANDs, ORs, etc. but being able to do it across multiple bytes is an interesting thought.

An important thing is to be able to accumulate for later use in the API, perhaps something like an internal bank of 8 registers, the first four eligible as either "left" or "right" operand, all of them eligible as the target of the result or the "right" operand, which show up in the I/O space register bank according to the operand register setting: aaa.ll.rrr.

m00dawg · Post by **m00dawg** » Thu Feb 15, 2024 2:37 pm

That's an interesting thought. I'll have to ponder that! There is 16 entire bytes I'm not using in the I/O space though I suspect you mean within the internal address space of the FPGA. Something like ring or bank registers and/or with an option to have something like VERAs stride where you can write to the same address. That indeed would allow operations to happen on large swaths of data efficiently.

Similar but simpler vein, I was thinking while trying to go to sleep last night of having a transfer op, which moves R into I or J making repeat ops quicker though that might only be faster in certain situations vs using the 6502. But something like:

LDA #APU_MULT
STA APU_INSTRUCTION
...
LDA #APU_TRJ
STA APU_INSTRUCTION
...
LDA #APU_MULT
STA APU_INSTRUCTION

Given "instructions" can be really anything (which feels kinda CISC-like, eeww

), I could see having a complex instruction do multiple things (so multiply where R gets placed into I so you can re-run the multiply over and over). Or perhaps a flag that does this which could be set via the control register. I originally rather liked the idea that I and J were for writing (from the perspective of the 6502) and R was always for reading and that kind of violates that but perhaps not in a way that matters.

BruceRMcF · Post by **BruceRMcF** » Fri Feb 16, 2024 12:30 am

I think you were reading my comment as being smarter than it was.

I was thinking 8 slots, and a byte that says for Result = op(Data1,Data2), which slot Data1 and Data2 come from and which slot Result goes into.

But it works better if instead of fixed slots, what you have is four rows of operand specifications, and the relation byte specifies which of the rows you look for for Data1, Data2 and Result.

But thinking about the A_1 = op(A_0,Data1) actual "accumulator" pattern, and the "square operation" A_0 = op(Data1,Data1) pattern, and the issue of stable versus striding references, I've got a hunch that this might work:

Row_n (0-3): width, stride, byte0 Address of Result/Data2; byte0 Address of Data1

Then the operator relationship byte is ss.aa.ll.rr, where:
ss: which row does the stride come from
aa: which row is the result address (the first of the two) taken from
rr: which row is the operand1 address (the second of the two) taken from -- the only operand for monadic operations
ll: for dyadic operations, which row is the operand2 address (the first of the two) taken from.
On the bus, there are two operand windows, which via the relationship byte gives a window to the address referred to. Writing the first one gives the "left" operand, or initial value of an accumulator in a pure accumulator relationship, writing the second one gives the "right" operand, or sole operand for a monadic operation, and reading the first one gives the result value.

On thing this gives you is that MOVE can be seen as a monadic ":=" operation, so you can move from %rr to %aa using the stride %ss with a single implicit MOVE operation.

These sets of four operand addressing rows can be seen as akin to sprite or a tile: you can set up the relationship patterns that you need for various operation outside of a tight loop, and inside of a tight loop you are just putting one or more of an operand byte, a relationship set index byte and a relationship specification byte, and the two operand windows are set up the way you need them.

m00dawg · Post by **m00dawg** » Fri Feb 16, 2024 3:20 am

"I think you were reading my comment as being smarter than it was."

Haha well if I did, your last comment was definitely smarter than I might be

But I think I finally got it, at least part of it. To save cycles, one can load registers for the operations sequentially via a stride like approach, so something like this:

Code: Select all

LDA #$01
sta APU_WIDTH
LDA #APU_ADD
sta APU_INSTRUCTION
lda APU_REGISTER
sta #$12   ; first operand
sta #$34   ; second operand
; some potential amount of waiting (loop to check a flag, etc.)
lda APU_RESULT

Here I used APU_RESULT but if I read your explanation right, it could be the same address as APU_REGISTER. In the case of either multi-byte or SIMD the APU_WIDTH determines the order of loading the data.

Am I somewhat close to home plate here?

m00dawg · Post by **m00dawg** » Sun Feb 18, 2024 5:46 pm

Been thinking about the stride stuff up above though have been spending last week being down with a hurt back and then food poisoning

So instead of doing some hard thinking, I looked at what my own implementation of the API might look like. While I'd actually like to design something that is discrete/purpose built from LUTs in the FPGA, I'm thinking of using FemtoRV to write the instructions in RISCV first. RISCV won't be directly exposed out like in my crazy plans which take over the bus. Instead I'll have a program which implements the various APU instructions as sub-routines, similar to what it might look like in a RP2040.

This might force a design change that I wanted to be flexible, and that's the the APU_INSTRUCTION should be sent last as this is that the RISCV can program can trigger on. So the previous example, assuming stride is being used, would be:

Code: Select all

lda #$01
sta APU_WIDTH
lda #$12
sta APU_REGISTER
lda #$34
sta APU_REGISTER
lda #APU_ADD
sta APU_INSTRUCTION
; some potential amount of waiting (loop to check a flag, etc.)
lda APU_RESULT

This kinda makes sense with any instruction that would have otherwise needed a clock anyway so I think is a reasonable requirement to put in place.

Nice thing about using the FemtoRV core is I can make use of the serial support already built in for troubleshooting bus connections as well as debug the operations. Once I have things working, I can then asses performance. If it's fast enough, that might be all that's needed. If not, next step is to pull operations out into optimized logic blocks (akin to CISC, alas). The ICE40 is too small to use RISCV vector extensions and those don't seem to work quite how I was thinking anyway, which means for doing the SIMD style stuff, the RISCV program would have to do some looping whereas I could build a parallel adder in Verilog directly to avoid that entirely.

That's the plan anyway! I've ordered some of Kevin's prototype expansion cards, some level shifters, and some other odds and ends so I can mount the Upduino onto the card and actually start playing around with things pretty soon (in between working on DreamTracker anyway).

m00dawg · Post by **m00dawg** » Tue Feb 20, 2024 2:04 am

Been waiting all damn day for DHL to get to my house in the morning, realize they didn't have my (expensive) package in the van, then drive all around town stringing me along >:( But that gave me some time to work on some X16 projects. So I spent time working on squashing some DreamTracker bugs and also had a real bee in my bonnet as it were to see what the RISCV program might look like. Here's what it is so far with just the ADD instruction:

Code: Select all

// Register Aliases
#define INSTRUCTION_ADDR	s0
#define WIDTH_ADDR			s1
#define CONTROL_ADDR		s2
#define STATUS_ADDR			s3
#define I_ADDR				s4
#define J_ADDR				s5
#define R_ADDR				s6
#define I					a0
#define J					a1
#define R					a2

// RISC-V Addresses of our X16 APU registers
.equ    APU_INSTRUCTION,	0x02000000
.equ    APU_WIDTH,			0x02000001
.equ    APU_CONTROL,		0x02000002
.equ    APU_STATUS,			0x02000003
.equ    APU_I,				0x02000004
.equ    APU_J,				0x02000008
.equ    APU_R,				0x0200000C

// 24-bit math mask
.equ	MASK_24BIT,			0xFFFFFF00

// Literals
.equ	INST_ADD,			0x01

// Width Literals
.equ    WIDTH_8BIT, 			0x01
.equ    WIDTH_16BIT, 			0x02
.equ    WIDTH_24BIT, 			0x03
.equ    WIDTH_32BIT, 			0x04
.equ    WIDTH_SIMD8, 			0x11
.equ    WIDTH_SIMD16, 		0x12

// Status Literals
.equ    STATUS_COMPLETE, 	0b01000000

.global _start
_start:
main:

setup:
	la INSTRUCTION_ADDR, APU_INSTRUCTION
	la WIDTH_ADDR, APU_WIDTH
	la CONTROL_ADDR, APU_CONTROL
	la STATUS_ADDR, APU_STATUS
	la I_ADDR, APU_I
	la J_ADDR, APU_J
	la R_ADDR, APU_R

// Loop until instruction is not zero
idle_loop:
	lbu a3, 0(INSTRUCTION_ADDR)
	c.beqz a3, idle_loop
lookup_instruction:
	li t1, INST_ADD
	beq a3, t1, add_instruction

post_instruction_cleanup:
	// Set instruction back to 0
	sb x0, 0(s0)
	// Set the operation complete status
	lbu t0, 0(STATUS_ADDR)
	li t1, STATUS_COMPLETE
	or t0, t0, t1
	sb t0, 0(STATUS_ADDR)
	c.j idle_loop


add_instruction:

check_width:
	la s1, APU_WIDTH
	lbu t0, 0(s1)
	// Check Width Operation
	li t1, WIDTH_8BIT
	li t2, WIDTH_16BIT
	li t3, WIDTH_24BIT
	li t4, WIDTH_32BIT
	li t5, WIDTH_SIMD8
	li t6, WIDTH_SIMD16
	beq t0, t1, add_single_8bit
	beq t0, t2, add_single_16bit
	beq t0, t3, add_single_24bit
	beq t0, t4, add_single_32bit
	beq t0, t5, add_simd8
	beq t0, t6, add_simd16
	// If we're here, bad state, so don't do anything
	c.j post_instruction_cleanup
	
add_single_8bit:
	lbu I, 0(I_ADDR)
	lbu J, 0(J_ADDR)
	c.add I, J
	sb I, 0(R_ADDR)
	j post_instruction_cleanup

add_single_16bit:
	lhu I, 0(I_ADDR)
	lhu J, 0(J_ADDR)
	c.add I, J
	sh R, 0(R_ADDR)
	j post_instruction_cleanup

/*
 Odd one out since we don't have a 3-byte load
*/
add_single_24bit:
	lw I, 0(I_ADDR)
	lw J, 0(J_ADDR)
	li a3, MASK_24BIT
	c.and I, a3
	c.and J, a3
	add R, I, J

	// To preserve registers, grab the 4th value of R
	// store the full 32-bit add, then put the 4th value
	// back.
	lb t0, 4(R_ADDR)
	sw R, 0(R_ADDR)
	sb t0, 4(R_ADDR)
	j post_instruction_cleanup

add_single_32bit:
	lw I, 0(I_ADDR)
	lw J, 0(J_ADDR)
	sw R, 0(R_ADDR)
	j post_instruction_cleanup

/* Add each 8-bit pair of I and J into R */
// Unrolled vs Loop
add_simd8:
	lbu I, 0(I_ADDR)
	lbu J, 0(J_ADDR)
	c.add I, J
	sb I, 0(R_ADDR)

	lbu I, 1(I_ADDR)
	lbu J, 1(J_ADDR)
	c.add I, J
	sb I, 1(R_ADDR)

	lbu I, 2(I_ADDR)
	lbu J, 2(J_ADDR)
	c.add I, J
	sb I, 2(R_ADDR)

	lbu I, 3(I_ADDR)
	lbu J, 3(J_ADDR)
	c.add I, J
	sb I, 3(R_ADDR)

	j post_instruction_cleanup

/* Add each 16-bit pair of I and J into R */
// Unrolled
add_simd16:
	lhu I, 0(I_ADDR)
	lhu J, 0(J_ADDR)
	c.add I, J
	sh I, 0(R_ADDR)
	lhu I, 2(I_ADDR)
	lhu J, 2(J_ADDR)
	c.add I, J
	sh I, 2(R_ADDR)
	j post_instruction_cleanup

Compared to 6502, there's some things I like and some things I don't. I miss being able to reference a literal value by #THING and haven't sorted our when to consider a routine or when to jump. Since the address space is much larger, jumping is what I ended up doing but that won't likely scale the whole way up.

On that note, using GNU gas, I haven't sorted out how to segment things out very well like I do in ca65 using scopes, procs, and things. Being a single threaded RISCV program, "SIMD" is a lie. Assumption is the FPGA will be clocked much higher so as to have an appearance of SIMD on the x16 side. Building a 4x8-bit adder would be much faster.

Also missing here is the register bank Bruce got me into thinking about. I mean really it's just adding so about as simple as I can get here but wanted to get an idea of how it might work.

Commander X16

Proposal for a hardware-agnostic math accel API

Proposal for a hardware-agnostic math accel API

Re: Proposal for a hardware-agnostic math accel API

Re: Proposal for a hardware-agnostic math accel API

Re: Proposal for a hardware-agnostic math accel API

Re: Proposal for a hardware-agnostic math accel API

Re: Proposal for a hardware-agnostic math accel API

Re: Proposal for a hardware-agnostic math accel API

Re: Proposal for a hardware-agnostic math accel API

Re: Proposal for a hardware-agnostic math accel API

Re: Proposal for a hardware-agnostic math accel API