Jump to content
  • 0

Small Compiler Design?



I'm currently writing a tokenizer, parser and bytecode interpreter, in C, using cc65.  

The target is a shell (almost just a REPL) with pipes, structured control flow scripting, untyped variables, and (probably) user functions.  With an ALGOL-descended syntax.

This is the first one I've written, and the programmer's motto is "plan to throw the first one away".  In other words, I will learn how things are done, and how not to do them for 8 bit machines.

I learned quite a lot from the tokenizer, and made some design modifications to use the X16's banks.  I LOVE those RAM banks.

I'm building up the interpreter as the compiler grows.  The interpreter seems to be rather straightforward.  Making it smaller *typically* means removing opcodes, which makes the VM slower.

So that leaves the compiler.  I suspect there are tricks I can do which can shrink the compiler.  For example, designing the grammar for semantic analysis... of course I don't want it to be LISP or FORTH, so my constraints involve ALGOL-descended scripting languages.  And, perhaps, shoving appropriate work into non-sparse tables.  I could also remove language features, but that gets painful fast.

But your experience, suggestions and comments are welcome.


  • Pipes |, >, and <.
  • Two data types: long integers and strings.  No floats.
  • I haven't decided how to handle arrays.  I like the CSH approach of giving *every* variable an indexed array of values.
  • Sigils to indicate variables ($) and perhaps function references (&).
  • Minimal control flow IF, FOR, WHILE, maybe UNTIL.  These need block indicators.  Maybe just END.
  • Logicals && || ! 
  • Long integers and integer maths + - * / % ^ += -=
  • Arithmetic comparison > >= == != <= <.  Careful distinguishing these from pipes!
  • String comparison eq, ne, gt, gte, lt, lte.
  • String catenation.
Edited by rje
Link to comment
Share on other sites

3 answers to this question

Recommended Posts

  • 0

Hi Rje,

I have nowhere near as much knowledge in this area as you, and I have never dealt with a lot of the things you have mentioned (e.g. Algol and Forth). I find it interesting to follow your project.

There are two things that came to my mind.

1. Use hashes instead of string literals.
When tokenizing an input line of code, one has to compare if there are any matches between keywords of your language and the input text. Instead of storing long string literals, one could simply store a hash value of that string and compare it to hashes of input words. This is a trick I have seen in the PC-demoscene, where they replace long library names/paths by hashes and thus save valuable space. Whether or not this approach pays off, may depend on how much literal string data you will end up having in your program. One additional pitfall is, that there *could* be duplicate hashes. So you would have to check that among all valid keywords, there won't be any duplicate hashes.

2. Jump tables
I'm sorry for writing about assembly stuff, when you are doing C. At least there is the possibility to do inline asm, but might be tricky for what I'm suggesting. Anyway, this technique has been very useful to me so far, and enables extremely compact 6502 code.

Suppose you have a number from 0 to N-1 which tells us, we should do one of N possible things. A bit like a "switch" statement in C.

   lda my_switch  ; load the variable which tells us what to do
   asl            ; multiply with 2, so we can index 16 bit addresses
   tax            ; put it into indexing register
   jmp (jmp_table, x)
   .word subroutine_1
   .word subroutine_2
.word subroutine_3
   ; continue your program
   ; ...

   ; do something 1
   jmp return_here  

   ; do something 2
   jmp return_here
   ; do something 3
   jmp return_here

I was wondering if this could be beneficial for your interpreter of bytecodes, where such situations could happen a lot.

(There is even a way to do this kind of thing for "proper" subroutines (notice how, instead of using RTS, I explicitely jump back to the desired address). Before doing the indirect JMP, one has to push the return address minus 1 to the stack. This emulates a JSR, which enables the use of RTS to do the return)

EDIT: sorry for messy code ... don't know why it appears differently in the final view than in the editor.

Edited by kliepatsch
  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Answer this question...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


  • Create New...

Important Information

Please review our Terms of Use