Jump to content

Custom BASIC interpreter for X16


VincentF
 Share

Recommended Posts

Hello Everybody,

Some days ago I started to think about making a custom BASIC interpreter for the X16. This interpreter would take advantage of the entire memory map and the capabilities of the X16.

It's still "on paper", right now I got a very small prototype that is able to read a number from the prompt and that's all. I started to write the code into the banked RAM, but I get some issues due to wanting to use the other banks so I need to find out how to get my program in one of the banked ROM instead.

 

To explain how I see this project, I'd like to use the RAM banks to store the variables (and why not the BASIC programs). Also, considering the speed of the X16, I'd like to see if we can tokenize the lines instead of storing plain text. This would take off the need to optimize every bit of BASIC line, making your programs unreadable, but also that needs more processing in order to tokenize the line and "untokenize" it when listing. Running pre-tokenized lines will likely make BASIC runtine faster.

For variable names, I'd see a "VARTAB" section that contains the two characters of the variable followed by the address of it. Lines of BASIC that uses the variable would get a pointer to a line of the VARTAB. in case the variable changes (like for strings), only the VARTAB entry changes its address.

constant values will be tokenized as well so it'll be stored in variable space (or in a constant space) without any entries in VARTAB.
I imagined some variable types like INT16, INT32, FLOAT16, STRING, ARRAY, etc... And even a special type that represents a segment of free space.

 

I think I'm getting a bit messy in my explanations, so I'm stopping for now. Nothing's fixed, I'm just trying to get some ideas / feedback and maybe some of you may be interested to participate in this ?

@rje If I remember correctly, you worked on a tokenizer already ? Do you think this project may be viable ?

I'm doing that just to learn how that work so if all of this is a bad idea, at least I got some fun imagining the thing and doing the prototype. 🙂

Edited by VincentF
title change
  • Like 1
Link to comment
Share on other sites

In addition to these things, I'd like to add a few more things to the necessary specification:

 

Reserved words to handle sprite frames and tiles, and syntax (and possible garbage collection) to handle (re)loading of CLUTs without having to resort to a whole bunch of VPEEKs and VPOKEs.

Reserved words to handle access to the PSG and Yamaha YM2151's sound channels and dynamic file linking to handle direct access to the YM3012 DAC and the PCM channel.

  • Like 1
Link to comment
Share on other sites

No need to forget your idea. Speaking as one who has released a language commercially through a previous employer, you will understand a language so much more by actually implementing one yourself!

I've got ideas for an environment I'd like to create ... just need more time to be able to focus on it.

  • Like 1
Link to comment
Share on other sites

I think that, specifically, tokenizing from an editor, line by line, is plenty fast on the X16.

Tokenizing an entire file all at once would probably be a bit laggy.  The Python model.  But, it might be OK?  You'd have to try it and see.

Banks ... just spitballing here ... banks could serially hold tokenized programs.  The "heap" would (probably?) have to be in main RAM, i.e. common to all the banks.  But the banks are a serial run of "memory hunks", so when you hit a bank boundary, you just skip lightly into the next bank and keep going.  That seems OK to me.

* *

And for BASIC it's probably OK... although... I suspect programs need more data space than program space (I mean, a 40K BASIC program is a monster.  I've been there.). So banks really are ideal for data.

Now if that data just happens to be program fragments, that's okay too.  But ... banks are best for data.  Not that you can't make it work -- and it might work really well -- but I'm just thinking this. 

 

Edited by rje
  • Like 1
Link to comment
Share on other sites

I think that tokenizing an entire program at run time, where "tokenizing" is simply replacing multiple character words with one byte tokens, probably wouldn't be *too* bad, but the longer the program, the more obvious it would be. That's what makes the traditional BASIC feedback loop so powerful. The language is dog slow, but it quickly gives you feedback. You don't notice how much time is being spent tokenizing individual lines as you go, but it would become obvious for a long enough program. Plus the whole program might not fit in memory if it wasn't tokenized a line at a time (for a sufficiently long program, anyway).

Python does far more than tokenize in its startup process. It is actually compiling, not just interpreting, the code, though it does compile to pcode instead of machine language. So Python has the benefits of compiling and the drawbacks of interpreting, and the drawbacks of compiling and the benefits of interpreting. It gives you a faster feedback loop than say C++, but still takes more time to compile at the beginning. But then, unlike BASIC, it doesn't have to re-parse every line every time. It is able to detect syntax errors and report them early rather than having to wait until it tries to execute that line.

Link to comment
Share on other sites

Love, love love these projects. 

If I can posit any advice from the cheap seats (i.e., from someone like me who absolutely has no ability to make such a thing, but who is a user of BASIC and enjoys it), it would be this:     There are things that BASIC needs but not everything from 'gen 3' basics like visual basic need to be included or are feasible.    Remember at bottom this platform is still an 8 bit machine with no prefetch, no branch prediction,  no modern pipelining,  no cache memory, almost no maths, and a 16 bit address space.   Its running at 8mhz, maybe, unless they (a) can't get it to work at that speed, in which case it will be a 4mhz machine, or (b) if they release the faster X8 FPGA version (that won't have the banked memory, ZP space, or space in the pages below $0800 that the X16 has) which probably won't be able to fit all your features anyway.     

Just take a look at the discussion near the end of the "BASIC 2 vs BASIC 7" thread and the impact of just a few more instructions in the byte fetching/parsing core between the C64 and the later Plus/4 and C128 in terms of the negative impact on performance for the later machines with better BASICs.   If you're having to bank-switch, for example, surely it takes a hit.   

Tokenizing a lot of stuff inline (e.g., constants, jumps, variable memory locations) is a great idea, but I suggest a simple escape code structure using byte codes.    Parser finds, say petscii code for '@' not inside quotes and it knows the next two bytes are a small-int in 16 bit signed format; it finds petscii code for [english pound] (which looks a bit like a mutant F), it knows the next 5 bytes are the exponent and mantisa for a float;  it finds token for 'goto' or 'gosub' it knows the next two bytes are the actual 6 bit address for the destination of the jump in the code, instead of the petscii numeric representation of a line number;  it finds petscii code for "%" it knows the next two bytes are the 16 bit address to the value followed by the name of an int style variable in memory.   (At execution it just needs to fetch the value, during LIST operation it grabs the variable name at that address+2 until it hits the terminator).   Yeah, OK, the modern way to do many of these things would be with a hash table, but I caution you to consider the performance impact on an 8 bit machine.  

If you use the idea of inlining 16 bit addresses for jump locations to speed up execution, of course, then there are other issues.    With line numbers, your "LIST" routine needs only follow the 16 bit address and then grab the line number at that address and put it on the screen during a 'LIST"; but with LABLES, you will need to set up a data structure (probably a linked list) that can be consulted by the interpreter during code LIST operations to regurgitate the labels or line numbers when the user lists the code and that metadata has to get saved with the program.    That's actually a better place to use banked memory...  the performance cost of swapping banks is not as important when listing the code.    I don't think its feasible to tokenize at runtime, it needs to be as you enter things.     

Link to comment
Share on other sites

30 minutes ago, Snickers11001001 said:

If you use the idea of inlining 16 bit addresses for jump locations to speed up execution, of course, then there are other issues.    With line numbers, your "LIST" routine needs only follow the 16 bit address and then grab the line number at that address and put it on the screen during a 'LIST"; but with LABLES, you will need to set up a data structure (probably a linked list) that can be consulted by the interpreter during code LIST operations to regurgitate the labels or line numbers when the user lists the code and that metadata has to get saved with the program.    That's actually a better place to use banked memory...  the performance cost of swapping banks is not as important when listing the code.    I don't think its feasible to tokenize at runtime, it needs to be as you enter things.     

I like what you wrote, and wrote many of the same myself months ago in another thread. I think an improved tokenizer that tokenized the entire line would help BASIC runtime a lot. But even if you have a tokenizer that know to create two tokens for "GOTO 1234" (one for GOTO, one for 16 bit value), all you've got is the next line number. You still have to find where that line is in memory in the linked list of BASIC lines.

But in general I am in complete agreement. A better tokenization scheme could go a long way to improving BASIC performance.

Edit: Sorry, that doesn't read well. I don't mean "I had a great idea and good for you" I meant "you have a great idea and I agree". It comes off more self congratulatory than I intended.

Edited by Scott Robison
Link to comment
Share on other sites

 

2 hours ago, Snickers11001001 said:

Love, love love these projects. 

If I can posit any advice from the cheap seats (i.e., from someone like me who absolutely has no ability to make such a thing, but who is a user of BASIC and enjoys it), it would be this:     There are things that BASIC needs but not everything from 'gen 3' basics like visual basic need to be included or are feasible.    Remember at bottom this platform is still an 8 bit machine with no prefetch, no branch prediction,  no modern pipelining,  no cache memory, almost no maths, and a 16 bit address space.   Its running at 8mhz, maybe, unless they (a) can't get it to work at that speed, in which case it will be a 4mhz machine, or (b) if they release the faster X8 FPGA version (that won't have the banked memory, ZP space, or space in the pages below $0800 that the X16 has) which probably won't be able to fit all your features anyway.     

Just take a look at the discussion near the end of the "BASIC 2 vs BASIC 7" thread and the impact of just a few more instructions in the byte fetching/parsing core between the C64 and the later Plus/4 and C128 in terms of the negative impact on performance for the later machines with better BASICs.   If you're having to bank-switch, for example, surely it takes a hit.   

Tokenizing a lot of stuff inline (e.g., constants, jumps, variable memory locations) is a great idea, but I suggest a simple escape code structure using byte codes.    Parser finds, say petscii code for '@' not inside quotes and it knows the next two bytes are a small-int in 16 bit signed format; it finds petscii code for [english pound] (which looks a bit like a mutant F), it knows the next 5 bytes are the exponent and mantisa for a float;  it finds token for 'goto' or 'gosub' it knows the next two bytes are the actual 6 bit address for the destination of the jump in the code, instead of the petscii numeric representation of a line number;  it finds petscii code for "%" it knows the next two bytes are the 16 bit address to the value followed by the name of an int style variable in memory.   (At execution it just needs to fetch the value, during LIST operation it grabs the variable name at that address+2 until it hits the terminator).   Yeah, OK, the modern way to do many of these things would be with a hash table, but I caution you to consider the performance impact on an 8 bit machine.  

If you use the idea of inlining 16 bit addresses for jump locations to speed up execution, of course, then there are other issues.    With line numbers, your "LIST" routine needs only follow the 16 bit address and then grab the line number at that address and put it on the screen during a 'LIST"; but with LABLES, you will need to set up a data structure (probably a linked list) that can be consulted by the interpreter during code LIST operations to regurgitate the labels or line numbers when the user lists the code and that metadata has to get saved with the program.    That's actually a better place to use banked memory...  the performance cost of swapping banks is not as important when listing the code.    I don't think its feasible to tokenize at runtime, it needs to be as you enter things.     

This is not far off from what I suggested a while back. Tokenize not just the keywords (PRINT, GOTO, etc) but also variable names and numeric literals. Assuming 01-12 are available as token codes, we could use:
01 - 8 bit byte (an integer literal between 0 and 255)
02 - 16 bit integer  (any integer literal between -32768 and 65535)
03 - 40 bit float  (any numeric literal with a decimal point, eg: 3.14 or 1.0)
04 - byte variable (# sigil or DIM x AS BYTE)
05 - integer variable (% sigil or DIM x AS INT)
06 - float variable (! sigil or DIM x AS FLOAT)
07 - string variable ($ sigil or DIM x AS STRING)
08 - label
09 - start of subroutine 
10 - start of function
11 - end of function or subroutine

PRINT 1234 gets changed to 
94 02 34 12

PRINT A$ might get converted to
94 07 01

and A$="HELLO" becomes
07 01 B2 "HELLO"

You could also change types on the fly by referencing a variable with a different sigil. So 
X = $1234
could be referenced with the byte sigil and would act like a 2 byte array:
PRINT X#  
34
PRINT X#(1) 
12
(Remember that arrays are zero-based)

This implies that arrays are nothing special: array variables would simply reserve more than 1 space in the heap, so:
DIM NAMES$(25) AS STRING would reserve 50 bytes on the heap, and if you recalled NAMES%(x), you would get back a 2-byte value, which is actually the pointer to the string. 

Where this comes in useful is creating large, arbitrary data arrays. For example, rooms in an adventure game. 
DIM ROOM#(1024) creates a 1K chunk of memory that can be used for any purpose. You could then load rooms in on-demand from disk, every time the player moves from one room to another. 

Labels and subroutine names would simply be more entries on the variable table.

The variable table itself is super simple:
01-02: data/code address
03: length of variable name
04-?? text of variable name

There are no types on the variable table, because the type is determined at runtime based on the token code. The token code is determined at compile time based on the sigil or a DEF <BYTE | INT | FLOAT> statement.

There are a ton of advantages to this system. Right now, the BASIC routines all have to parse their own data. Doing it this way means the data is pre-parsed. The routines simply read the parameters directly out of the program stream.

The actual program text can be more compact, too. You don't store spaces. You don't store commas in parameter sequences. Those just discarded and re-created if the program is detokenized (listed).


 

  • Like 1
Link to comment
Share on other sites

2 hours ago, TomXP411 said:

 

This is not far off from what I suggested a while back. Tokenize not just the keywords (PRINT, GOTO, etc) but also variable names and numeric literals. Assuming 01-12 are available as token codes, we could use:
01 - 8 bit byte (an integer literal between 0 and 255)
02 - 16 bit integer  (any integer literal between -32768 and 65535)
03 - 40 bit float  (any numeric literal with a decimal point, eg: 3.14 or 1.0)
04 - byte variable (# sigil or DIM x AS BYTE)
05 - integer variable (% sigil or DIM x AS INT)
06 - float variable (! sigil or DIM x AS FLOAT)
07 - string variable ($ sigil or DIM x AS STRING)
08 - label
09 - start of subroutine 
10 - start of function
11 - end of function or subroutine

PRINT 1234 gets changed to 
94 02 34 12

PRINT A$ might get converted to
94 07 01

and A$="HELLO" becomes
07 01 B2 "HELLO"

You could also change types on the fly by referencing a variable with a different sigil. So 
X = $1234
could be referenced with the byte sigil and would act like a 2 byte array:
PRINT X#  
34
PRINT X#(1) 
12
(Remember that arrays are zero-based)

This implies that arrays are nothing special: array variables would simply reserve more than 1 space in the heap, so:
DIM NAMES$(25) AS STRING would reserve 50 bytes on the heap, and if you recalled NAMES%(x), you would get back a 2-byte value, which is actually the pointer to the string. 

Where this comes in useful is creating large, arbitrary data arrays. For example, rooms in an adventure game. 
DIM ROOM#(1024) creates a 1K chunk of memory that can be used for any purpose. You could then load rooms in on-demand from disk, every time the player moves from one room to another. 

Labels and subroutine names would simply be more entries on the variable table.

The variable table itself is super simple:
01-02: data/code address
03: length of variable name
04-?? text of variable name

There are no types on the variable table, because the type is determined at runtime based on the token code. The token code is determined at compile time based on the sigil or a DEF <BYTE | INT | FLOAT> statement.

There are a ton of advantages to this system. Right now, the BASIC routines all have to parse their own data. Doing it this way means the data is pre-parsed. The routines simply read the parameters directly out of the program stream.

The actual program text can be more compact, too. You don't store spaces. You don't store commas in parameter sequences. Those just discarded and re-created if the program is detokenized (listed).

Yes, this is very similar to what I did for PCBoard, except I was not as concerned as maybe I should have been about token size. I created two byte tokens (it was for 16 bit DOS systems, after all). Every statement the language supported was a two byte signed positive number (no zero). Every function or operator that the expression engine supported was a two byte signed negative number. Variables were just an index into an array of dynamically typed values. Zero was used to mark the end of an expression where was was valid. So a PRINT statement that looked something like this:

PRINT a, b*2, c+3

Would be tokenized as:

$0001 $0003 $0001 $0000 $0002 $0003 $FFFF $0000 $0004 $0005 $FFFE $0000

Token meanings are:

  1. PRINT statement
  2. Count of expressions to evaluate for PRINT
  3. Index of variable a in vartab (pushed on eval stack)
  4. End of first expression (pop result from eval stack and print it)
  5. Index of variable b in vartab (pushed on eval stack)
  6. Index of constant 2 in vartab (pushed on eval stack; I stored constants as variables with impossible to generate names for runtime simplicity and stack evaluation consistency)
  7. Multiplication operator (pop two values, multiply them, push result)
  8. End of second expression (pop result from eval stack and print it)
  9. Index of variable c in vartab (pushed on eval stack)
  10. Index of constant 3 in vartab (pushed on eval stack)
  11. Addition operator (pop two values, add them, push result)
  12. End of third expression (pop result from eval stack and print it)

The value of the tokens are not exact, but it illustrates the idea. My tokenized / pcode scripts could use an entire 64 KB for tokens, or 32,768 tokens max, plus extra memory for the vartab. Each token was stored at a given offset in the token array, so GOTO/JMP (whether used explicitly or as part of a structured construct) would just update the PC to the new token index.

Obviously, on an 8 bit system, you'd want more compact tokens. And for an interactive interpreter, you'd have to store them more like you suggested so that as lines were added, edited, and deleted, the program would still be runnable. My language had a script compiler that translated from a text file to the tokenized form which meant I didn't have to worry about LISTing it back out later (though some enterprising third party did reverse engineer it and created a decompiler for it which was able to do a great job, as I did not attempt to write an optimizing compiler). I remember the README file that came with it being critical of how simplistic it was and how it should have been done in a different way to generate machine code, but that wasn't the point. We wanted to let people customize the system. And later when we ported PCBoard to OS/2, most scripts continued to run unmodified because we didn't tie them to a particular machine representation. The only exceptions were if people used the PEEK & POKE and similar statements because ... I was able to do it, so why not? 🙂

https://www.99-bottles-of-beer.net/language-ppl-562.html gives a longer if not completely useful example.

Edit: Note that I started off with one byte tokens. The initial ambition was to create a simple scripting language. As things happen at times, as people saw it able to do more and more, a lot of "what if we added..." features crept in. Enough to overflow a byte. A more complex encoding would have been possible, but I was still a fairly young developer (I was about 25 when I did this project) and either didn't consider it and/or didn't have the time to rework the portions I had already written.

Edited by Scott Robison
  • Like 1
Link to comment
Share on other sites

4 hours ago, Scott Robison said:

Edit: Note that I started off with one byte tokens. The initial ambition was to create a simple scripting language. As things happen at times, as people saw it able to do more and more, a lot of "what if we added..." features crept in. Enough to overflow a byte. A more complex encoding would have been possible, but I was still a fairly young developer (I was about 25 when I did this project) and either didn't consider it and/or didn't have the time to rework the portions I had already written.

Yeah, it would help to be a little more dense on an 8 bit computer. Even 4 escape tokens would allow for 1024 commands, on top of 124 "prime" commands (ones with a single byte token), while keeping 32-127 as ASCII characters. 

 

 

 

 

Edited by TomXP411
Link to comment
Share on other sites

Bank switching is workable. The 6502 BASIC is designed to be bank switched in 8k in any arrangement you fancy (or not at all) and there's only a slight speed hit, slightly worse on an original 6502 which doesn't have jmp (aaaa,x) - the modules are

  • the inline 65C02 assembler
  • error handling
  • device specific stuff I/O Files etc. Some stuff is standardised like text I/O, some is system specific like Sprites and Sound functionality.
  • system specific extensions - Sprites, Sounds, VPoke, that sort of stuff.
  • floating point (not actually written but the hooks and conversion stuff is all there)
  • Input/Output/Command Line/Program editing stuff
  • String handling & Management
  • Tokeniser/Detokeniser
  • Variable Management
  • Core (everything else)

Most of the time it lives in the core, which contains most of the actual program execution code and integer 32 bit arithmetic.  For the other commonly called internal things (Variable Management, strings and FP) the overhead of the task switching is minimal compared, and the rest of it doesn't really affect the speed of the programmes. There is a small amount of code duplication where I decided it was worth duplicating functionality rather than switching back to another module, but very little. 

 

  • Like 2
Link to comment
Share on other sites

7 hours ago, paulscottrobson said:

Bank switching is workable. The 6502 BASIC is designed to be bank switched in 8k in any arrangement you fancy (or not at all) and there's only a slight speed hit, slightly worse on an original 6502 which doesn't have jmp (aaaa,x) - the modules are

  • the inline 65C02 assembler
  • error handling
  • device specific stuff I/O Files etc. Some stuff is standardised like text I/O, some is system specific like Sprites and Sound functionality.
  • system specific extensions - Sprites, Sounds, VPoke, that sort of stuff.
  • floating point (not actually written but the hooks and conversion stuff is all there)
  • Input/Output/Command Line/Program editing stuff
  • String handling & Management
  • Tokeniser/Detokeniser
  • Variable Management
  • Core (everything else)

Most of the time it lives in the core, which contains most of the actual program execution code and integer 32 bit arithmetic.  For the other commonly called internal things (Variable Management, strings and FP) the overhead of the task switching is minimal compared, and the rest of it doesn't really affect the speed of the programmes. There is a small amount of code duplication where I decided it was worth duplicating functionality rather than switching back to another module, but very little. 

 

Wow, that's actually quite mature already.     Is there a 'howto' to try it out on the X16 emulator?    Is it a custom rom or something? 

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...

Important Information

Please review our Terms of Use