Jump to content

8-Bit Smackdown!


Recommended Posts

Interesting vlog. I think it is a lot like comparing apples and oranges. The 6502 family had a smaller instruction set to be sure, but it was easier to remember the whole thing. An opcode byte was always an opcode byte, and the value of the byte didn't depend on the context as much. But certain things that were easy to do on Z80 were harder to do on 6502.

As far as the difference between immediate and memory addressing, that's more an "accident" of the different syntax. One could just as easily write an assembler that just used a modified syntax for the 6502:

"LDA #$42" could be "LDA $42" then
"LDA $42" could be "LDA ($42)"

Years of x86 actually lead me to prefer the alternative syntax. Using something like that for LDA, alternatives for the 65C02 addressing modes come out to something like:

"LDA $42"
"LDA [$42]"
"LDA [$42+X]"
"LDA [$42+Y]" (note: there is no zero page indexed by y, so it would be assembled as absolute indexed by y)
"LDA [$4242]"
"LDA [$4242+X]"
"LDA [$4242+Y]"
"LDA [[$42]]"
"LDA [[$42+X]]"
"LDA [[$42]+Y]"

I personally like that (though it probably wouldn't gain great traction given its unfamiliarity to those familiar with traditional 6502 syntax) because it looks like memory addressing, and more levels of brackets are just more levels of indirection.

As for the argument that the traditional indexed indirect isn't very useful, I think it can be incredibly useful. It might not be so useful in a Commodore or general purpose computer, but I can easily imagine it being used in some embedded application where a range of ZP is used to hold addresses to parallel arrays, and X can be used to select which array you want. Or ZP is used as a pointer stack to arguments being processed by a virtual stack machine. The biggest problem with this on legacy Commodore systems (I am not as familiar with Atari, Apple, etc) is the seemingly random use of ZP by the kernal and BASIC leaving very few ZP locations free for user applications.

Edited by Scott Robison
  • Like 2
Link to comment
Share on other sites

It's almost worth creating my own syntax with a simple transpiler over the amount of time I've spent hunting bugs where I missed out the '#' in a 'lda $xx'.

If it wasn't for losing the syntax colouring in VSC, I'd probably have done it by now!

Asm keywords are a bit archaic. Made sense back in the day, as you didn't want big source files. Not so much nowadays.

 

Link to comment
Share on other sites

That remark about (zp,X)  reminded me of Forth's using the zero page for the Return stack -- the "Rack". Then LDA (zp,X) is fetching the xt.

EXIT: DEX : DEX

NEXT: INC RS,X : BEQ + : -- LDA (RS,X) : STA V

  INC RS,X : BEQ ++ : - LDA (RS,X) : STA V+1 : JMP (V)

+ INC RS+1,X : BRA --

++ INC RS+1,X : BRA -

 

ENTER: ; JSR ENTER : .word xt1, xt2, ... EXIT

  INX : INX : PLY : PLA : INY : BNE + : INC : + STY RS,X : STA RS+1,X : BRA --

___________________

I glitched and wrote that with JMP (RS,X) but that would be jumping into the compiled Forth word, rather than the address of the Forth primitive.

As far as why it is not INC RS,X : BNE + : INC RS+1,X ... the overflow happens on average once every 128 xt's, so saving a clock when not fixing overflow easily pays for branching back in on the rare event that overflow is fixed. Aligning on even addresses would save two clocks in every NEXT, but it's a pain to do aligned addresses when it's not required by the CPU.

Edited by BruceMcF
  • Like 1
Link to comment
Share on other sites

On 5/9/2021 at 4:57 PM, BruceMcF said:

I  glitched and wrote that with JMP (RS,X) but that would be jumping into the compiled Forth word, rather than the address of the Forth primitive.

JMP (a,X) can be used, but it's used with self modifying code. And using X as an instruction pointer index means using ,Y as one of the two stacks ... which may as well be the Rack at $9FE0,Y because there is no zp,Y for accumulator addressing. To claw a couple of clocks back, it can be a pair of byte stacks, so ,Y only requires adjusting by one.

IP = NEXT1+1

EXIT: DEY : LDA RL,Y : STA IP : LDA RH,Y : STA IP+1 : LDX #2 : BRA NEXT1

NEXT: INX : INX : BMI +

NEXT1: JMP (0,X)

+ TXA : CLC : ADC IP : STA IP : LDX #0 : BCC NEXT1 : INC IP+1 : BRA NEXT1

ENTER: INY : TXA : CLC : ADC IP : STA RL,Y :  LDA IP+1 : ADC #0 : STA RH,Y

 PLA : STA IP : PLA : STA IP+1 : LDX #1 : BRA NEXT1

So a 12 clock cycle NEXT can be obtained with JMP (a,X), but at the cost of a more complex ENTER/EXIT, and by using self modifying code.

___________________________

The CLEAN model that works with (zp,X) is actually a bit-threaded model. In this model, the start of a code word is aligned onto an even address, but primitives are tokens into a vector table of up to 127 addresses bases on ODD addresses, so if the bottom bit of the low byte is "1", that is the token for a primitive, while if the bottom bit is "0", that is the first byte of the address of compiled code. So words compiled to LESS than two bytes per word, and there is no need for a leading "JSR ENTER", saving three more bytes per definition, more than paying back the extra byte approximately half the time for aligning the start of each compiled definition.

 

EXIT: DEX : DEX

NEXT: INC RS,X : BNE + : INC RS+1,X

NEXT1: LDA (RS,X) : BIT #1 : BEQ ENTER : STA PRIM : JMP (PRIM)

ENTER: INX : INX : STA RS,X : INC RS-2,X : BNE + : INC RS-1,X : + LDA (RS-2,X) : STA RS+1,X : BRA NEXT1

 

 

Edited by BruceMcF
Link to comment
Share on other sites

  • 2 weeks later...

I looked at this more closely: I found one model where using LDA (R,X) is an actual optimization. This is using (IP),Y as the main Forth program counter, and an X-indexed return stack in the zero page, and up to 64 primitives that are even bytes from $80 to $FE. A pointer to high level compiled code is a big-endian (high byte first) integer, and with compiled Forth code in Low RAM, this is an address <= $7FFF, so the high byte is positive. (This might be flipped around for a core in $8000-$9EFF and modules in $A000-$BFFF High RAM banks).

Where (zp,X) mode is used is fetching the second byte of the compiled code address, in getting around the need to store the first byte in a temporary location somewhere while juggling.

; IP is a two byte location in ZP with the low byte zero, PRIM is a three byte location in ZP with "JMP (PRIMTABLE)", R is a 64+ byte return stack in zero page.

EXIT: DEX : DEX : LDY R,X : LDA R+1,X : STA IP+1
NEXT: INY : BEQ +
NEXT0: LDA (IP),Y : BPL ENTER : STA PRIM+1 : JMP PRIM
+ INC IP+1 : BRA NEXT0
ENTER: INY : BEQ +
- INX : INX : STY R,X : LDY IPH : STY R+1,X : STA IP+1 : LDA (R,X) : TAY : BRA NEXT0
+ INC IP+1 : BRA -

If Exit and Enter are executed on average 1/8th of the time each, that is a 22 cycle NEXT, additional 34 cycles for the EXIT, and NET additional 40 cycles for the ENTER (because execution of ENTER shortcuts the call through the primitive table), for 22+(34+40)/8 ~= 30.25 cycles overhead. So nowhere near as fast as the ~12.4 cycles of subroutine threading, but with single byte primitives, probably very compact compiled code ... especially if it is a separated dictionary system. So the word density of individual HighRAM modules would likely be fairly high.

 

Edited by BruceMcF
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...

Important Information

Please review our Terms of Use