Jump to content

New community dev tool uploaded: BASIC PREPROCESSOR


Recommended Posts

BASIC PREPROCESSOR

View File

BASIC PREPROCESSOR allows one to create Commodore BASIC programs with a normal text editor without line numbers. Features:

  1. Much as strings begin and end with a quotation mark ("), macro constructs begin and end with a commercial at sign (@). This means that you cannot include @ in a macro, but otherwise any character may be used.
  2. A label can be defined on a line by itself as @+LABEL@.
  3. A label can be referenced after a GOTO or GOSUB as @-LABEL@ (including ON statements).
  4. A long variable name can be used as @!NAME@.
  5. A preprocessed comment can be used as @' whatever text you want @. These comments are not written to the PRG file.
  6. Any leading whitespace on a line is removed before writing the code to the PRG file.
  7. The preprocessor (probably) requires an emulator built from the master github branch.

The program is written almost completely in BASIC. The one exception has to do with tokenization. Normally as you enter lines of BASIC the computer will translate them into a compressed tokenized form, and this is necessary for the programs to be usable. In order for BPP.PRG to create tokenized BASIC programs, it has a small machine language routine in golden RAM that converts from plain text to tokenized form. The tokenized form is written to the output PRG file.

Here is a super simple example called SIMPLE.BPP.

Quote

@' THIS IS A COMMENT '@

@'
   THIS IS
   A MULTILINE
   COMMENT
'@

@+LOOP@
  IF @!COUNT-VAR@ > 10 THEN PRINT "DONE COUNTING": END
  GOSUB @-INC-VAR@
  GOTO @-LOOP@

@+INC-VAR@
  @!COUNT-VAR@ = @!COUNT-VAR@ + 1
  RETURN

An animated GIF demonstrates the process of using the program.

bpp.gif


 

  • Like 2
Link to comment
Share on other sites

1 hour ago, mobluse said:

I think this is great. Does it handle labels in ON...GOTO and ON...GOSUB?

How do you enter the BPP files? since the built in editor requires line numbers.

It should, though I've honestly not tried it yet. Just the simplest stuff. Let me check! {time passes} Yes!

The program works by looking for @ symbols. Just as a normal string is delimited with quotation marks, labels are delimited with @ symbols in my preprocessor. So when it encounters a label outside a string, it replaces it with the line number (or something approximating the line number).

As for how to enter a BPP file, yes, you'd need a text editor like x16 edit or some such. This is the one place I cheated. x16 edit isn't working for me (I suspect because I'm running bleeding edge emulator I built after cloning github). So I used a text editor on my Windows machine, saved it into a sdcard image, and then ran it from there.

So this is not ready for primetime probably if one wants to use it in r38. It is just barely ready if you are using bleeding edge emulator. Or at least it is for me.

Edited by Scott Robison
Link to comment
Share on other sites

Obviously there are better ways to do this, but back when I had my C=64, I didn't have the benefit of an extra computer on which to do dev work. I'm already cheating a little bit there as I admit above using an external text editor, but I could have created the text file without it, so I'm allowing it. Text editing is the only external task I'm allowing myself thus far.

So my first substantive program is BASIC PREPROCESSOR. It is pure basic plus one ML routine, which I did completely with the MONITOR in the emulator then copied the bytes to DATA statements.

My second program will be BASIC EDIT. Not a competitor to x16 edit, but something very simple that can edit small text files. I will write it in BASIC PREPROCESSOR syntax so it will serve as the first "big" example of BASIC PREPROCESSOR. My intent is to write the smallest possible editor I can that allows me to add text, remove text, save files, and load files. Once I have that done I will try to do all my dev "natively" in the emulator (which is a contradiction in terms, but it is hopefully clear enough in context).

I do this not because it is the "best way" ... just because if I'm going to go retro, let's go retro!

  • Like 1
Link to comment
Share on other sites

I think this is an interesting solution to the shortcomings of the built-in BASIC.

I haven't had time to try it out yet, but I will.

As to the source code file format of any programming language that is developed for the X16 - be it BASIC, FORTH or assembly - it would be great if we used a common standard so that the source code may be edited in any present or future editor available on the platform.

The plain text PETSCII or ASCII file is the reasonable solution in my mind.

Plain text file formats are, however, not exactly the same on different computer platforms. This is most evident when thinking about line break encoding. We have at least the LF (ASCII 10) used in todays Linux/MacOS, the CR (ASCII 13) used by legacy MacOS, and of coarse the CRLF (ASCII 13+10) used by Windows.

Commodore 8 bit computers did not recognize ASCII control character 10. Even though there were a lot of custom solutions, the closest we have to a standard for line break encoding on Commodore machines is a single CR. That is also used by the VolksForth compiler that is available to us.

Link to comment
Share on other sites

6 hours ago, Stefan said:

I think this is an interesting solution to the shortcomings of the built-in BASIC.

I haven't had time to try it out yet, but I will.

As to the source code file format of any programming language that is developed for the X16 - be it BASIC, FORTH or assembly - it would be great if we used a common standard so that the source code may be edited in any present or future editor available on the platform.

The plain text PETSCII or ASCII file is the reasonable solution in my mind.

Plain text file formats are, however, not exactly the same on different computer platforms. This is most evident when thinking about line break encoding. We have at least the LF (ASCII 10) used in todays Linux/MacOS, the CR (ASCII 13) used by legacy MacOS, and of coarse the CRLF (ASCII 13+10) used by Windows.

Commodore 8 bit computers did not recognize ASCII control character 10. Even though there were a lot of custom solutions, the closest we have to a standard for line break encoding on Commodore machines is a single CR. That is also used by the VolksForth compiler that is available to us.

I agree that a standard would be good. If you download my plain text BPP.BAS file and look at it, you'll see it is "ASCII/PETSCII" compatible with the one exception that it does use CRLF line endings because I edited it on Windows. My solution to that problem, because line ending wasn't a huge consideration for me, is that I consider CR *or* LF to be a line ending character. So when preprocessing BASIC text that originated on Windows, I wind up with line numbers 0, 2, 4, 6, etc. Had I used just one or the other, I should have used each line number (0, 1, 2, 3, etc). Except for empty lines, which I do not emit.

Because I've not tested it extensively, I am not sure what it would do with some corner cases such as "non empty line that only has space characters in it". I think, because of the way I leverage the BASIC crunch routine to tokenize the line, it might skip that, but it wasn't important to my proof of concept so I didn't dig deeper, esp since the program is dependent on the bleeding edge github, I think.

Link to comment
Share on other sites

7 hours ago, Stefan said:

Plain text file formats are, however, not exactly the same on different computer platforms. This is most evident when thinking about line break encoding. We have at least the LF (ASCII 10) used in todays Linux/MacOS, the CR (ASCII 13) used by legacy MacOS, and of coarse the CRLF (ASCII 13+10) used by Windows.

Commodore 8 bit computers did not recognize ASCII control character 10. Even though there were a lot of custom solutions, the closest we have to a standard for line break encoding on Commodore machines is a single CR. That is also used by the VolksForth compiler that is available to us.

Given that the platform is embracing more characters than the 8-bits of old, I think the right "text file source code EOL standard" should be "one or more characters in the set CR or LF". So it could handle plain old 8-bit CR terminated lines, Unix-y LF terminated lines, and DOS/Windows CRLF lines. Then tokenizers / parsers could easily skip blank lines as meaningless (unless of course someone decided that a blank link should be a syntactic construct, in which case they'd want to be more judicious).

As for ASCII vs PETSCII, it would be nice if there was some sort of a BOM character like exists for Unicode that could be used as the first character in a file to identify the encoding.

For those who do not know (I'm not trying to talk down to anyone, we just all approach this with different backgrounds), original Unicode was a strictly two byte per character encoding. There was no UTF-8. The problem presents itself: Are my characters in little endian or big endian order? U+FEFF was defined as a "Zero Width No-Break Space" character which means it is just white space, so easily ignored by most language processing software. U+FFFE (the reversed form of U+FEFF) was defined at some point as "noncharacter" that should not appear in unicode text. So U+FEFF became the simple way to determine which character encoding was in use.

With PETSCII vs ASCII, we don't have the byte ordering issue, but sniffing the encoding would still be useful. According to https://www.pagetable.com/c64ref/charset/ we have several flavors of SPACE in PETSCII:

$20: Normal Space Character (SP in either ASCII or PETSCII)
$A0: No-Break Space (NBSP in either IEC-8859-15 or PETSCII, the two native encodings on x16)
$E0: No-Break Space (NBSP in PETSCII but a-grave in 8859-15)

None of those are particularly useful for differentiating between ASCII vs PETSCII.

Another solution is what many editors support, which is to include a magic comment as the first line of source code that encodes metadata about the file. I think this is our best bet. In BASIC source code like my BPP.BAS file, I could include a first line like:

REM ENC=PETSCII EOL=CR

To signal the compiler that my file is in PETSCII encoding and uses CRLF as the end of line marker. In C one might create a line like:

/* ENC=8859-15 EOL=LF */

In ASM code maybe:

; ENC=ASCII EOL=CRLF

And so on. I would suggest that the "de facto" standard for x16 source:

1. Looks at the beginning sequence of characters up to the first CR or LF character.
2. The characters should be unshifted alphabetic characters so that uppercase ASCII and uppercase PETSCII (in graphics charset) map to the same set of character codes $41 - $5A. If in mixedcase PETSCII, it would be lowercase letters.
3. Valid encodings that should be recognized by all x16 compatible software should be ENC=PETSCII, ENC=ASCII, ENC=8859-15.
4. Valid end-of-line types that should be recognized by all x16 compatible software should be EOL=CR, EOL=LF, EOL=CRLF.
5. The valid character set of these NAME=VALUE pairs should be limited to alphabet (codes $41-$5A), digits, equal sign and hyphen with spaces appearing before and after each.
6. This allows for easy extension to include new attributes we might not consider now that would be generally useful, or for individual software to define their own custom NAME=VALUE pairs for their own use.

This is just stream of consciousness ideas that does not obligate anyone to define a rigidly enforced standard. But it could be useful.

Link to comment
Share on other sites

Yes, it's not necessarily easy to get this right.

I think a magic comment would work fine for your BASIC preprocessor. That is, if you decide to support different options.

I'm sure there are valid historical reasons for the double byte CRLF on Windows, but it's hard to see the benefit of that encoding today. It only makes parsing the file more complicated in my opinion.

Link to comment
Share on other sites

4 hours ago, Stefan said:

Yes, it's not necessarily easy to get this right.

I think a magic comment would work fine for your BASIC preprocessor. That is, if you decide to support different options.

I'm sure there are valid historical reasons for the double byte CRLF on Windows, but it's hard to see the benefit of that encoding today. It only makes parsing the file more complicated in my opinion.

History. Mechanical printers / teletype / whatever often required a CR to return the carriage (the print head did not move necessarily, but the paper like in a ancient typewriter), then a LF to scroll the paper one more line. Windows does it because DOS did it. DOS did it because CP/M did it. CP/M wasn't about translating control codes from an abstract set of commands to a device specific set of commands, which is why many printers in the day had jumpers that you could set to customize how it would respond to things like ESC sequences and CR & LF & etc.

I actually still use stand alone CR frequently when writing console mode / terminal mode programs to continuously update the same line with updated status information in a long running process.

Edited by Scott Robison
Link to comment
Share on other sites

30 minutes ago, Ed Minchau said:

You're right, that is an ugly-but-simple syntax.  It could be prettier and simpler though, perhaps something like this:

GOSUB [QUERYNAME]

[+LOOP]

PRINT "HELLO ";N$;"! HOW ARE YOU?       ";

GOTO [LOOP]

[+QUERYNAME]

INPUT "WHAT IS YOUR NAME";N$

RETURN

That is also another way it could be done. I was looking for a "self delimiting character" so I didn't want to use brackets, though I agree they are more readable. My scanning code is so simple and it trusts that @ is matched with @, " is matched with ", and that anything between the open and close character is part of the label. I could use apostrophe instead of @ as a "more readable self delimiting character".

Link to comment
Share on other sites

1 hour ago, Scott Robison said:

That is also another way it could be done. I was looking for a "self delimiting character" so I didn't want to use brackets, though I agree they are more readable. My scanning code is so simple and it trusts that @ is matched with @, " is matched with ", and that anything between the open and close character is part of the label. I could use apostrophe instead of @ as a "more readable self delimiting character".

Maybe just use the + sign when declaring the label, and not use the minus? The easier it is to use, the more it will be used.

Link to comment
Share on other sites

Perhaps. It's my own brand of OCD feeling like the first character defines the type of the "name" (definition or expansion). I was thinking about possibly other ones as well, this is just as far as I got when I decided to share it. "+" to add to symbol table, "-" to look up in symbol table. Maybe "!" to declare a long variable name that can be mapped to a short two character name.

It's very much a thought exercise seeing just how far I can get with pure BASIC, which I already violated to call the crunch routine in ML. 🙂

  • Like 1
Link to comment
Share on other sites

On 4/21/2021 at 12:05 PM, Scott Robison said:

My second program will be BASIC EDIT. Not a competitor to x16 edit, but something very simple that can edit small text files. I will write it in BASIC PREPROCESSOR syntax so it will serve as the first "big" example of BASIC PREPROCESSOR. My intent is to write the smallest possible editor I can that allows me to add text, remove text, save files, and load files. Once I have that done I will try to do all my dev "natively" in the emulator (which is a contradiction in terms, but it is hopefully clear enough in context).

I actually started a vi-like editor, in BASIC, in February.  I dinked with it for two days then put it on the shelf.

 

It uses the left-arrow as the escape to command mode.  Which I had forgotten, and therefore is probably a bad idea.

 

On the cutesy side, I apparently named it "xvi", which is therefore a pun both on vi and on the x16.

 

 

 

main-vi.list vi.bas

Edited by rje
Link to comment
Share on other sites

49 minutes ago, Scott Robison said:

Perhaps. It's my own brand of OCD feeling like the first character defines the type of the "name" (definition or expansion).

Maybe you're an LL(0) kind of guy.

That's my preference as well.  Simplifies the parsing logic, without actual pain for the developer.

Edited by rje
Link to comment
Share on other sites

10 hours ago, Scott Robison said:

As for ASCII vs PETSCII, it would be nice if there was some sort of a BOM character like exists for Unicode that could be used as the first character in a file to identify the encoding.

I think..... just off the top of my head..... there are characters that are non-printable codes in one but printable in the other.... if the file is on the X16 then, surely there's a character that renders "wrong" (for some value of "wrong") if it's ASCII..... and therefore would never have a right to exist in a text file... hrmmmmmm.....

Establish guardrails, such as "this must be a text file", and then you'll be able to detect whether or not it's PETSCII by whether or not that first character is actually PETSCII printable.  If it ain't, then assume that's a flag indicating the rest of the file is ASCII...? 

 

For example, ASCII 15, 16, 21, 22, 23, 25, 26 appear to be printable, but not in PETSCII.

Edited by rje
Link to comment
Share on other sites

32 minutes ago, rje said:

A thoughtful feature I like about this are the REM statements that show where the labels were.

 

As it turned out, that was a lazy hack on my part that wound up being useful. Since a label may be defined on a line by itself (or not) there needs to be a line number to associate with the label, as I don't really know at the time of parsing what the next generated line number would be. So rather than making the system more complex, I just had it generate a REM so that I could target the line.

Link to comment
Share on other sites

13 minutes ago, rje said:

I think..... just off the top of my head..... there are characters that are non-printable codes in one but printable in the other.... if the file is on the X16 then, surely there's a character that renders "wrong" (for some value of "wrong") if it's ASCII..... and therefore would never have a right to exist in a text file... hrmmmmmm.....

Establish guardrails, such as "this must be a text file", and then you'll be able to detect whether or not it's PETSCII by whether or not that first character is actually PETSCII printable.  If it ain't, then assume that's a flag indicating the rest of the file is ASCII...? 

 

For example, ASCII 15, 16, 21, 22, 23, 25, 26 appear to be printable, but not in PETSCII.

There might be some terminals or systems that render ASCII or 8859-15 control codes, but technically by the standard they have no visible representation. $00-$1F and $7F-$BF in ASCII are not printable in that context. Even though many of them aren't valid C source characters, for example, people will often include them anyway because compilers will not complain (it is either undefined or implementation defined behavior, I can't remember which off the top of my head.

That's not to be confused with the screen codes, which are distinct from the XSCII encoding.

Link to comment
Share on other sites

37 minutes ago, rje said:

I actually started a vi-like editor, in BASIC, in February.  I dinked with it for two days then put it on the shelf.

It uses the left-arrow as the escape to command mode.  Which I had forgotten, and therefore is probably a bad idea.

On the cutesy side, I apparently named it "xvi", which is therefore a pun both on vi and on the x16.

You and I are kindred spirits, because I was thinking of a pseudo vi-inspired editor. I'm not a fan of vi, really, so I'm not sure, but it was my thought. I was thinking to use back arrow or british pound for the esc character if I went that route.

And I do enjoy the punny name. It's not so punny that I have to groan, and it is cutesy.

Link to comment
Share on other sites

Yeah, that's how I approached it, as well.  I liked the tidy separation of "colon-command" mode from edit mode, but I wasn't trying to be slavishly recreating a vi thing.

My code doesn't let the user cursor around the screen, so for instance I can't go back and erase part of a line or insert a new line, etc.  I'd have to add that in...

I'm not sure, but I think the X16 can't currently write and read SEQ files in BASIC.  So I didn't get to the point of reading and writing.  Unless that's been fixed.

I see I have a 256-element string array as a buffer.  One string per line.  Not sure if that's the best way to do it.  Would an array of words be better?  

Alternately, I think using RAM banks as a buffer would be stupendous.  I love those RAM banks.  Memory management would be a pain, though.

 

Edited by rje
Link to comment
Share on other sites

1 hour ago, rje said:

Yeah, that's how I approached it, as well.  I liked the tidy separation of "colon-command" mode from edit mode, but I wasn't trying to be slavishly recreating a vi thing.

My code doesn't let the user cursor around the screen, so for instance I can't go back and erase part of a line or insert a new line, etc.  I'd have to add that in...

I'm not sure, but I think the X16 can't currently write and read SEQ files in BASIC.  So I didn't get to the point of reading and writing.  Unless that's been fixed.

I see I have a 256-element string array as a buffer.  One string per line.  Not sure if that's the best way to do it.  Would an array of words be better?  

Alternately, I think using RAM banks as a buffer would be stupendous.  I love those RAM banks.  Memory management would be a pain, though.

My limited experience with it so far agrees with you that SEQ doesn't seem to be a thing yet, and that's using the bleeding edge github source.

My thought had been to use the RAM banks for text. For simplicity, probably to divide them into maximum line lengths that I would support and avoid strings. I fear garbage collection in this version of BASIC if most characters wind up destroying and creating multiple strings in the heap. If I was going to use actual BASIC strings though, I think I would use one array element per line, not per word.

  • Like 1
Link to comment
Share on other sites

You're right about the RAM bank thing: the data just represents 80 characters per line, all lines accounted for.  The editor itself can track "actual" line lengths for when the thing is written to disk.

Then you allocate one bank at a time, with each bank representing appx 100 lines or so.

 

 

Edited by rje
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...

Important Information

Please review our Terms of Use