Jump to content

BASIC: Converting/Optimizing a simple BASIC program from another Commodore platform to the X16...


Recommended Posts

Hi folks.   I thought it might be fun to do a sort of "walk-through" post as I go about converting a short and simple BASIC program from another Commodore machine to the X16.   

Before I start, let me be really clear:   This is going to be very entry level stuff, and probably over-explained.    I'm trying to go for something that will be accessible to folks who are not experts but want to look into the X16.   If people find this useful, I'll consider walking through a more difficult conversion.    But the point of this initial write-up is to show how easy it is to convert quite a lot of the BASIC software out there, while also addressing some typical and important ways to optimize BASIC for performance.  

EDITED TO ADD:   I also want to make a more general disclaimer.   This entire exercise is occurring using the X16 emulator as a stand in for the platform.   Also, the ROM and architecture are not yet completely final.   Things could change.   The emulator could be structured in such a way that it does not represent real world performance of what the X16 hardware would do.  Don't expect the 'time' values from our optimizations to be exactly how this same BASIC program performs on the final X16 or any future emulator release.  

For this exercise I thought it would be fun to convert something from the Commodore Plus/4, which (like the X16) had some added bitmap graphics commands so you did not have to use pokes as on the C64.  Browsing the ( very very good)  'Plus/4 World' web site, I saw this screenshot which seems like a nifty plotting routine that should work on the X16:   


So as a first step, lets fire up a Plus/4 emulator (VICE or PLUS4EMU are fine) and run the program on the original intended platform to make sure it works.    Wait:  What's this? 


Oh.   OK, its apparently from the DANISH version of RUN magazine, circa 1984.   My mediocre powers of deduction (i.e., simply seeing the characters '3D') were sufficient for me to ascertain that option 3 was the correct one, and I then went with what appeared to be the recommended parameters in a couple follow-up 'INPUTS' put up by the program after that.   

Viola!  The Plus/4 emulator started plotting the same graphic depicted above.   

And.....  wow  is it SLOW on the Plus/4!    I mean  really, REALLY slow.   As in just a few pixels plotted per second.       

After crying uncle and hitting RUNSTOP, I took a look at the program listing and found that only lines 0-9 and 1000 are used for the routines that plot the output in the first screenshot above.  That's the part we'll convert to the X16, if we can.   (The other parts of the original program are simple 2d coordinate plotting based on the same formula shown on the initial first screenshot).    I deleted the program lines for the other routines and menus, added a 'GOTO' to flow things directly to the routine we're converting without having to pick it from the (now deleted) foreign language menu, and replaced a couple 'INPUT' statements asking for parameters with code that just sets those variables to constants using the recommended values from the original listing (see lines 3 and 4 below).  Finally,  I added commands to reset the system  'TIME$'  (i.e, TI$) variable just before the start of the main output loops,  and to print the elapsed time held in that variable at the end of the drawing.  This will show us exactly how painful this was to run on the original Plus/4 hardware and, more importantly, it will greatly assist us in benchmarking our performance optimizations on the X16 version later.     

Here's what that BASIC listing for the Plus/4 version (after chopping it down to the essentials as described above) winds up looking like so far on the Plus/4 emulator:


OK, lets just do a quick feasibility / sanity check at this point.    In other words, CAN this be easily converted to the X16?    Looking at the listing, I see the Plus4's 'DRAW' command (and its 'DRAW[ ] TO [ ]' variant) to make pixels and lines.   We can do that on the X16 with 'PSET' and 'LINE.'    I see the 'GRAPHIC' and 'SCNCLR' commands which set up and clear the Plus/4 bitmap screen, and we have 'SCREEN $80' to accomplish the same on the X16.   Finally, I see the "COLOR" command used by the Plus/4 to set background, text, bitmap pixel 'color source', and border colors. Although the X16 handles bitmap colors differently, there's nothing so far that lacks directly parallel (or close enough) functionality on the  X16.   

More important, of course, are the things we DON'T see in the listing above.   I don't see any real headaches or show stoppers in the Plus/4 listing.   There are no 'SYS' commands implying machine code or ROM routines; there's nothing poking a bunch of 'DATA' into memory or poking at platform specific (e.g., the TED chip on the Plus/4) hardware registers, either.   There aren't graphics commands  (such as circle or flood fill) that might be infeasible or slow/painful to implement on the X16 using just BASIC commands and without getting deep into some VERA /API expertise and lore.  

Looks doable.    Just for fun, I turned on the turbo mode of the Plus/4 emulator and let the routine listed above proceed all the way to the end.   Even with TURBO on I had time to go get a cup of coffee AND a refill.  But in the end, I can report that:

(a) the paring down of the program to the listing reflected above didn't break things on the Plus/4 emulator (I got the same output as the screenshot up top of this post); and, 

(b) from the printout of the 'TI$' variable at the end (showing hours, minutes, seconds, in the HHMMSS format), it appears it would take a genuine Plus/4 (or an emulator not running in TURBO mode) over TWO HOURS AND FORTY-TWO MINUTES to plot that screen!   


Yikes!   Luckily, the X16 will be at least 8 times faster just on its CPU clock speed, and very likely even faster than that given the inherent speed of VERA and the skills of our X16 devs in implementing the X16 BASIC graphics command additions.   Also, just glancing at the original Plus/4 listing, I already see a TON of optimizations we can do to speed up this program once we get it working on the X16. 

OK, next step before starting the process to adapt this little routine is to actually get the program onto the X16 emulator without having to type it all in.    To do so, I first saved the program as listed above from the Plus/4 emulator in ".prg" format.   Now we have a .PRG we know we can load into the X16 emulator! 


Well, that's it for this opening/introduction post.   In the write ups / replies that follow, I will try to do the following things (in the following order):

1.   Make it work;


2.   Make it fast. 


Edited by Snickers11001001
  • Like 5
Link to comment
Share on other sites

-- continued -- 

In the Introduction, we identified a short and simple Commodore Plus/4 program for conversion to the X16, tried it out on an emulator for the original platform, and decided it would be feasible for adaptation.   We pared the listing down to just the routine we want, and loaded it up into the X16 emulator.   

To review, the Plus/4 listing looked like this as displayed on the Plus/4 emulator:    


For reasons that will be apparent in a second, we'll need to keep this handy to look at once we're at work on the X16 emulator. I like to code BASIC on the X16 using its 40 column mode.   I could blame my old-dude eyes, but really its my old-dude nostalgia.    The command to switch to 40 column mode is 'SCREEN $00'.  

Lets do that and then 'LIST' the program as it came into the X16 after being saved from the Plus/4 and see how it looks...


Hmmm... something's.... ODD.  

Compare lines 2 and 6 of the X16 listing with the screen shot from the Plus/4 emulator listing, above.  Weird, right?   Is this data corruption?   Does this mean we need to give up on this approach and simply type the whole listing into the X16 from the Plus/4 screenshot?   

No.   It doesn't.  

Here's what happened:  The X16 is based off Basic 2.0, i.e. from the Commodore 64.    Although the X16 devs have added X16 specific graphics extensions, these do not use the same BASIC tokens as those used for 'DRAW' and 'GRAPHIC' and 'SCNCLR' and 'COLOR' from the Plus/4.   See, in BASIC the 'keywords' (ie BASIC commands) are not stored as the whole word,  but are held in program listing memory as a 'token' -- a byte code -- that the interpreter uses to refer to the command.   When you 'LIST' a program, it looks up any tokens it finds and prints out the text representation within your listing.   

But since the X16 does not recognize the Plus/4 tokens, its LIST routine sort of coughed up other BASIC commands in the screen output.   Its not a big deal, since we already know we need to replace the Plus/4 commands with X16 versions to get our program working anyway. 

Looking through the rest of the program, it appears that only lines 2 and 6 need to be modified (at least at the outset) to deal with obvious differences between the platforms.  The rest of the program does not appear to contain anything that's not 'vanilla' Commodore BASIC.   By that, I mean to say that these lines are simple 'FOR/NEXT' loops, calculations and program flow control and should work the same on the C64, the Plus/4 the C128 and, yes, the Commander X16.     As  you gain experience, you'll learn to recognize what things are part of the core of BASIC and consistent among the Commodore versions, and what things are modified, added, or handled differently between versions.    This is a good link to help with that:   https://www.c64-wiki.com/wiki/BASIC

Moving on:   In line 2 we need to substitute the X16 BASIC extensions equivalent to the Plus/4 'DRAW' commands along with the comma separated parameters set up correctly for what the X16 is expecting. 

The original Plus 4 code is as follows (I'm adding spaces here just for easy reading although BASIC still works -- and is faster -- without them):

.(Plus/4 Version)
	2  DRAW 1, X1, Y1: DRAW 0, X1, Y1+1 TO X1, 199: RETURN

If you know the Plus/4 (or look up its user manual) you'll find that the first command - 'DRAW 1, X1, Y1' - is set up to plot a single pixel.  The first parameter '1' tells Plus/4's BASIC to use its  'color source 1' (foreground color) to draw that pixel, and the next two parameters are the horizontal (x) and vertical (y) coordinates, which the original author passes to the command using variables called 'X1' and 'Y1'.   

Next we see a slight variant of the Plus/4's 'DRAW' command.  After the parameters already listed (only, this time using the Plus/4's color source '0' -- the background color) we see the 'TO' operator and a couple more parameters.  Again consulting the Plus/4 manual, we find that this causes the command to draw a line using the specified color source, and with the line running from the set of coordinates before the 'TO' operator to the set of coordinates after it.   

The X16 command to draw a single pixel is 'PSET.'   The X16 command to draw a line is, aptly, 'LINE'.    

According to the X16 reference guide (see the DOCS on this webpage), PSET takes three parameters in this order:  x coordinate, y coordinate, color. The X16 'LINE' command takes 5 parameters, like so:  x1,y1,x2,y2,color, where the first x/y pair is the starting position of the line, and the second pair are the line's endpoint.  

A quick note about that X16 'color' parameter in the 'PSET' and 'LINE' commands.   Instead of referring to a 'color source' (with a value set elsewhere) as on the Plus/4, the X16 just expects a number from 0 to 255 (or a variable containing a number from 0 to 255) to be here, and it uses that number to index into the X16 color palette and will perform the 'PSET' or 'LINE' using the color it finds there.   The default X16 color pallet looks like this:


The first 16 colors (indexed as values 0 to 15) are intended to sort of approximate the C64/128 palette.  The next 16 colors (values 16 to 31) are a nice greyscale range; and the rest of the values up to 255 are as shown. 

In order to convert line 2 from the Plus/4 to the X16 we simply put the variables and expressions used by the Plus/4 version for the x and y parameters of its 'DRAW' commands in the right places for 'PSET' and 'LINE' commands on the X16.   And since we've seen the Plus/4 output of this program we know it draws black foreground pixels on a white background.  That means we will use an X16 color parameter value of '0' (black on the X16 palette) for the command where the Plus/4 uses foreground color source, and an X16 color parameter value of '1' (white from the X16 palette) for the command where the Plus/4 listing uses the background color source.  Finally, the 'RETURN' from subroutine command is the same on the X16 as on the Plus/4.  

Thus, our converted line 2 looks like this:

(X16 Version)
	2 PSET X1,Y1,0: LINE X1, Y1+1, X1, 199: RETURN


The only other line we need to mess with before saving and running this for the first time is Line 6, which on the Plus/4 looks like this:

.(Plus/4 Version)
	6  GRAPHIC 2: SCNCLR: COLOR 0,2: COLOR 1,1: TI$="000000"

As before, the X16 was confuzzled by the Plus/4 tokens since we loaded this directly from a saved Plus/4 PRG file instead of typing it in.  We'll just retype line 6 and the new version will replace the old one.   And this one's super easy:  The Plus/4 reference materials let us know that 'GRAPHIC 2' sets that machine to display its  320x200 hires bitmap with a split screen for text (in this mode, the Plus/4 let you print text output on the bottom five lines; and that many lines worth of space for text are displayed instead of whatever would be on the bitmap at that part of the screen),  The Plus/4 command 'SCNCLR' clears the screen.

The X16's BASIC 320x200 bitmap mode is activated with 'SCREEN $80'  and, thanks to VERA's layers, the BASIC bitmap can already display graphics *and* 'standard out' text simultaneously.  Also, the 'SCREEN $80' command automatically clears the X16 screen.  We don't need the Plus/4 'COLOR' commands at all because the X16 does colors differently (as noted above) and we already told our X16 'PSET' and 'LINE' routines what to do, above.   Finally, the command I added to reset the 'TI$' variable needs no changes between the two machines. 

So the converted line 6 looks like this:

.(X16 Version)
	6  SCREEN$80: TI$="000000"

 Let's take our X16 program and save it now before we give it our first 'RUN' to test things.   Here's how things look: 


Alright, time to 'RUN' this sucker and see if we were able to 'make it work'...  here goes: 


Well, uh...., dang.  That didn't go well.   

The first thing we see is that the X16's combined text and bitmap screen (mode $80) was indeed activated, as indicated by the white background and light blue text.   But what's this "illegal quantity" error message?  

Let's 'LIST' line 2 since that's where the error message says the problem occurred.   We can do that while still in the graphics screen since it also handles a text layer overlay.   Also, lets use a 'PRINT' from the command line to display the last values of the variables utilized in our line 2.  (NOTE:  When BASIC errors out, any variables that were set by the program up to that point are still in memory.  This is the case until you (a) change, add or re-enter a program line; (b) use the 'CLR' command to clear the variables; (c) use 'NEW' to scratch the BASIC program from memory; or (d) type 'RUN' again, which always starts with a clean slate.)

So here's what we see:


And..... there it is.   The X1 variable is negative, and it appears the X16 will not let you plot off of the bitmap screen.  

Let's back up a second.   Just like on the Plus/4, the coordinates of the 320x200 X16 BASIC bitmap are arranged so that the x axis runs from 0 to 319 horizontally from left to right, and the y axis runs from 0 to 199 vertically from top to bottom.  Thus, coordinates 0,0 are the upper-left-most pixel, and coordinates 319,199 are the lower-right-most pixel.  

On a Plus/4 (and C128, if memory serves) the BASIC bitmap graphics commands were very forgiving.  If you told BASIC to draw something where all or part of your bitmap command would try to put pixels outside the valid bitmap coordinates, the routines would simply not do anything with those out of bounds pixels and would output only the part(s) (if any) of the command that would place pixels within the valid bitmap field coordinates.   So if you tell a Plus/4 to draw a line from coordinates 100,100 to coordinates 1000,1000 it will plot a line from 100,100 right to the edge of the screen at coordinate 199,199 which is the last pixel of that line that is actually in the displayed bitmap field.   But it will just ignore the part of the line you told it to draw beyond that.  

Some other BASIC interpreters with dedicated graphics commands error-out instead, and THAT is clearly what the X16's 'PSET' and 'LINE' commands do when you tell them to put pixels outside of the valid range of bitmap coordinates.  

By the way:  I tend to think the way the X16 does it is probably better for performance, while the Plus/4 / C128 approach is better for beginners.  BOTH are better than the C64, because on that platform you 'POKE' values directly to memory to plot bitmap graphics, and if your program attempted (without bounds checking) to plot a pixel outside the bitmap, you could wind up altering a memory address someplace OTHER than the screen memory, and this might crash or freeze up the computer.

Back to our error:   Because the 'X1' variable was negative, it turns out that the very first command in line 2 asked the X16 to plot a negative x coordinate and the X16 didn't like that.   That negative coordinate was an 'illegal quantity' - thus the error message. 

We'll fix this with some bounds checking.   

Taking a look at line 1 (the line just prior to the instructions that caused our error) we can see something interesting:  


It turns out the author of the original Plus/4 program did try to put in some bounds checking in the form of that 'IF/THEN' statement.  The intention was clearly that 'IF' the values were out of bounds, 'THEN' it would execute the 'RETURN' command to jump out of the the plotting subroutine (lines 1 and 2) and would not proceed to the drawing commands in line 2.   

Interestingly, this was a wasted effort so far as I can tell and for several reasons.  First, as we have learned, the Plus/4 graphics commands simply ignore out of bounds values; Second, the author only checked the 'Y' variable (tied to the vertical position); and Third, that's the WRONG variable because the variables that get plotted are actually  the ones the author called 'Y1' and 'X1',  and those variables are derived from (but not identical to ) the variables the author named 'X' and 'Y'.    

We DO need to do bounds checking -- but on the x axis -- to avoid that error on our X16 version, but we've got another problem:    Line 1 is already packed and just about at the 80 character limit.   Ugh.  Ideally, full 'really save' bounds checking would have the 'IF/THEN' statement evaluate whether 'X1<0 OR X1>319 OR Y1<0 or Y1>199'.   But there's not sufficient room for all that; and, worse, the original author's numbering scheme doesn't let us simply insert a new line to move that IF/THEN statement to its own line number just prior to the line with the 'PSET' / 'LINE' commands. 

But, look at the screenshot of the output from the Plus/4 once again (keeping in mind that machine was using the same 320x200 sized bitmap as the X16):


That output demonstrates that over the full range of the function this program is plotting, the output doesn't come anywhere near the far right of the screen (i.e., its empirical output does not appear to approach a point where it would try to use an x coordinate greater than 319), nor does it come close to going out of bounds at the top (i.e., a y coordinate less than 0).   Unfortunately, the Plus/4 screenshot doesn't tell us whether or not there's a problem at the bottom of the bitmap, because the Plus/4 split screen hi-res mode has 5 lines of text displayed instead of whatever happened at the bottom of its bitmap. (Remember on the Plus/4 the mode invoked by it's 'GRAPHIC 2' bitmap initiation command results in a split screen, not a text overlay). 

But we do know FOR SURE that there's a problem on the left (when variable X1 is less than 0), however.  

With all this in mind, I think we can safely limit bounds checking (at least for now)  to just the cases where variable 'X1' would be less than 0 (trying to plot further left than the left side of the screen) or 'Y1' would be greater than 199 (trying to plot below the bottom of the screen).   We can also fix the original author's mistaken reference to a variable named 'Y' variable by substituting the 'Y1' variable that is actually passed to the graphics commands.   

Here's a screen shot of the fixed line 1 (the little arrows show what I changed):


Alright.   I know I really over-explained all that, but this is for the newbies and I just hope you rock-star machine-code and C guys can just bear with me and maybe nod and smirk to yourselves rather than flaming me, ok? 

OK, Lets 'RUN' it again and see if we got rid of the error!


Well, LOOK at that!   It worked.  

We've gotten the same neat looking output as the original Plus/4 screenshot.   Of course, in the X16's 'SCREEN $80' mode, the VERA can print text anywhere over the bitmap as an overlay (whereas the split screen mode used in the Plus/4 starts text output at the bottom five lines).  As a result, our text display of the elapsed 'TI$' variable was output at the top instead of the bottom...  

Even so, I'd call this a success in meeting our first milestone:     We made it work. 

And,  by the way, look at that time (remember, its HHMMSS format).   It 'only' took 11 minutes and 4 seconds.   But compared to 2 hours and 42 minutes on the original platform, we can see that the X16 really screams in performance compared to the Plus/4.   In fact, we achieved much better performance than can just be explained by the processor being 8mhz as compared to the 1mhz unit on the Plus/4.    This was more than 8 times faster.  

Still, I see lots of opportunities in this code to make it even faster.   I don't know if we'll be able to cut that 11 minutes in half, but I do think we'll be able to get it under 10 minutes and, who knows, maybe we'll do even better....  

But the optimization question is something we will work on in the next post.   Stay tuned!

Edited by Snickers11001001
fixed typo
  • Like 3
Link to comment
Share on other sites


In the previous posts, we selected a simple candidate program (a graphics routine written for the Commodore Plus/4) for conversion the X16; decided it would be an easy one;  got it loaded into the X16 emulator; and adapted it into a working program listing by substituting X16 bitmap graphics commands and the right parameters in place of those from the Plus/4.  And it worked: The X16 produced the same output on its BASIC bitmap. 

By setting the the special system 'TI$' variable to "000000" at the start of the main loops and printing its value to the screen at the end, we learned that the X16 completes its drawing in 11 minutes, 4 seconds (whereas the Plus/4 would take over 2 hours and 42 minutes!).  The X16 is even faster than its 8mhz vs 1mhz CPU advantage over the PLUS/4 can account for standing alone.  And that was even before starting to optimize the BASIC code.  

So now, let's 'make it fast.'     In my last post, I set a modest goal:  To get the elapsed time to less than 10 minutes.   On further reflection, it occurs to me that this HOWTO will be far more persuasive if we can do BETTER than that.   So let's shoot for getting to an elapsed time of less than NINE minutes, and in particular to around 8 minutes, 50 seconds or less.  That would mean successfully shaving more than two minutes, approximately 20%, off the current runtime, and that's worthwhile.

Here's a copy/paste of the program so far:

	0 GOTO 1000
1 X=BB+H/B+E:Y=DD-H/B+F:X1=INT(.85*X):Y1=INT(.9*(G-Y)):IFX1<0ORY1>199THENRETURN
2 PSET X1,Y1,0:LINE X1,Y1+1,X1,199,1:     RETURN
4 N2=90:REM _  ''    ''    ''    ''
5 A=144:B=2.25:C=N1:D=.0327:E=160:F=N2:G=199
6 SCREEN $80:TI$="000000"
1000 DEF FNR(Q)=COS(Q)+COS(2*Q)+COS(5*Q):GOTO3


Even if we don't care to figure out how they did a facsimile of axonometry with an 8 bit machine in 1984, it helps to try and understand this program.  Not in the order its listed (since there's a little spaghetti here) but structurally/functionally:

INIT:   Initialization consists of the jump from line 0 to 1000 where 'FN R()' is defined (stacked cosines); followed by a jump back to lines 3 to 6 that set variables; followed by graphics screen initialization and our command to reset the TI$ system variable.  Notably, the variables in line 5 do not appear on the left side of an equals sign anywhere else in the program.  They're never changed; so we'll regard them as 'constants' intended always to hold their initial values.

LOOPS:   There are two loops, with the second nestled inside the first.   The outer loop is indexed by variable 'H' (for now).  A calculation based in part on the value of that indexing variable affects the range of the inner loop indexed by 'BB' (again, for now) .  The inner loop calls the function 'FN R()'; does some tweaking of the results; jumps (using a 'GOSUB') to a 'plotting' subroutine located at line 1; and upon 'RETURN' from that subroutine we see that execution reaches the 'NEXT:NEXT' that triggers the next iteration of the loops.  Architecturally, the BASIC parser only reaches the second 'NEXT' when it falls through the first 'NEXT' due to the inner loop completing an entire run through its range.   (And, a fun fact the original author obviously knew:  'NEXT' without a reference to an indexing variable is faster than 'NEXT V' with an indexing variable.   If you're looping two things to where you need that double next, the 'NEXT:NEXT' construction is materially faster than "NEXT A,B')

PLOTTING:   The subroutine at lines 1 and 2 calculate a set of x,y coordinates, tweak them to become 'X1' and 'Y1' and plot the later.   There are calculations that scale x and y by multiplying by .85 and .9 respectively; and we see a subtraction of the raw y from variable 'G' (set as constant '199' during initialization) in the process, which means the original author inverted the vertical coordinate to get the plot to take the desired orientation on the screen.   The end result is actually plotting coordinates 'X1' and 'Y1', which you will recall we proceed to do only after using the 'IF/THEN' statement in line 1 to do our bounds checking as to the left and bottom of the screen.   Line 2 plots a foreground (black) pixel at X1,Y1, and then draws a straight vertical background (white) line from just below that pixel all the way to the bottom of the screen (vert coordinate 199).   This 'erasure' part of the plotting process clears out pixels below the current curve that were put there by prior curves, and contributes to the 3d effect by making it appear as earlier plotted curves are in further in the background, while each subsequent curve appears to be located more and more toward the foreground. 

WHEW.    That's a lot of verbal description for such a short program, but I believe its helpful to understand what's going on here.    

Quick observation:  Since we know the constants that represent the outer loop's starting point (-A or '-144'), ending point (A, or '144') and increment (variable 'B' i.e., '2.25') we can do the math and surmise that the outer loop runs 128 times.   But the endpoint of the inner loop depends on a calculation keyed to the then current value of the outer loop indexing variable , so we don't know at a glance how many times the inner loop iterates in total through a full run of the program.   We do know that for every outer loop, there is a full run of the range of the inner loop.   That makes sense from watching the program draw its output.   The outer loop sets the start point and parameters of a varying sinusoidal curve. The inner loop then walks along that curve and plots the points sampled at regular intervals onto to the bitmap screen.    

Let's count inner loop iterations.  To do so, we'll perform a quick 'RUN' of the program with three temporary changes:

  • We replace everything in line 8 up to and including the 'GOSUB1' within the inner loop with the command 'CT=CT+1'  (Note:  'CT' is a new variable solely for this temporary purpose and stands for 'count').   
  • Instead of printing our 'TI$' elapsed time at the end, we will have it  'PRINT CT' to display our final count.
  • We 'REM'  out the line that sets the bitmap screen since we don't need it for this test. 

This won't plot anything to the screen.    Instead its just going to increase variable 'CT' by 1 for every single iteration of the inner loop.  That will tell us how many times the inner loop and all the commands, calculations, variable fetches/stores, etc., within the inner loop are executed in a complete run of this program.    Lets run it:


So there it is.   The things inside the inner loop are executed 33,175 times over the full course of the program.   That's where the most value from optimizations will occur.   That inner loop.   

Anyhow, after reverting back to the actual program now that we have our inner loop count, I'm going to do four initial categories of things in a first optimization pass:

A.    Organizing.  I'm reordering the program to get rid of some spaghetti.  Let's just put all the initialization stuff at the start.   Followed by the nestled loops.   And instead of the 'plotting' sequence getting jumped to via a 'GOSUB' from the inner loop, I'll just inline that sequence within that inner loop (between the 'FOR' and 'NEXT' commands.   A 'GOSUB' to one of the very first lines of a program is about as fast as 'GOSUB/GOTO' can be accomplished, but we'll still save some time avoiding it.   Since this means our bounds checking 'IF/THEN' statement can't simply execute a 'RETURN' to skip the plotting commands, we'll do a forward 'GOTO' instead to jump over the 'PSET' and 'LINE' commands if the bounds check determines the current coordinates are off the left or bottom of the screen. 

B.    Renaming Variables & Switching to 1 Character Variable Names.  As a point of personal privilege I'm going to rename some variables.   To help you all follow along, I'm going to change the outer loop indexing variable to 'O' (for 'outer') and the inner loop indexing variable to 'I' (for 'inner').   I then replace 'H' with 'O' and 'BB' with 'I' throughout, wherever they're used.  Any variables that have two-letter names can be speeded up a hair by changing them to single letter names and I'll do that as well (except for 'X1' and 'Y1' because I have some thoughts about those and a possible improvement to the program I want to deal with later in the optimization process, probably in the next post).   Again, any renamed variable has to be replaced wherever its mentioned.   And, since the variables originally 'INPUT' as 'N1' and 'N2' (which you'll recall I just set as the recommended values from the original when I converted this) are directly placed into variables 'C' and 'F' (respectively) in the initialization, and thereafter unaltered (treated as constants), I'm taking out 'N1' and 'N2' and sticking their values directly into 'C' and 'F' .   No need to keep 'N1' and 'N2' as middlemen in that process.   

C.   A couple 'EASY' math / operations tweaks

        i)   In performing the stuff just above, I noticed one of the variables, 'AA', was set by taking the 'INT()' (integer) result of a calculation.   I'm not only renaming that variable, I'm changing its variable type to an 'Integer" by adding the '%' signifier at the end of the variable name.   We'll call it 'A%' -- still two characters, but this change allows us to remove the 'INT()' command and its parentheses.    In BASIC, storing a floating point number that fits into an integer type variable discards everything after the decimal point, which is all the 'INT()' command does anyway, except that its a separate command that takes longer than this approach.   CAREFUL!    Whereas regular 'float' variables in BASIC can store extremely large and extremely small numbers, an integer variable (named with the '%' type suffix) can store only a signed 16 bit value.   That means if the value that's supposed to go into the integer type variable is outside the range of  −32768 through 32767 this trick will break things (and, on a C64 - and thus X16-, would generate an 'illegal quantity' error) and we'd have to keep the INT().   But looking at the program and the values/operations that go into the calculation, it does not look to me like this will be a problem in this instance.  While we're getting rid of 'INT() functions, look at that plotting routine originally at line 1.  The author scales the raw 'X' and 'Y' by doing a calculation within an 'INT()' command in each case.  But it turns out that neither PLUS/4 nor X16 BASIC graphics commands (at least the ones used by our program) CARE AT ALL if you pass them a floating point number for pixel coordinate parameters.  The commands just disregard everything after the decimal point!  That means if variables  'X1=10' and 'Y1=10', the command 'PSET X1,Y1,0' plots a pixel at coordinates 10,10.   And in the event  'X1=10.771' and 'Y1=10.345901', well,  'PSET X1, Y1,0'  will STILL plot a pixel at coordinates 10,10.   So here we we can simply REMOVE the 'INT()' from those 'X1', 'Y1' calculations.   It was never necessary. 

        ii)   Notice in the stack of cosines from the function we moved inline, that the second cosine operation is 'COS(2*Q)'.     Multiplication is much slower in BASIC than addition, so we can tweak out a little speed increase by replacing that with 'COS(Q+Q) quite easily.     Q+Q of course is the same thing as 2*Q but internally the addition takes fewer CPU cycles than a multiplication.   This trick only works when replacing a 2*n operation with an 'n+n' operation because transforming a single multiplication into three or more additions (with accompanying additional variable fetches)  -- i.e., trying to convert '3*n' to 'n+n+n' or anything even more ambitious, would take longer than leaving the original expression as a multiply.  

D.   Pleasing the Parser Gods.   Switching to single character variables is itself a parser trick and will be very slightly faster in terms of BASIC running the program through its interpreter.  I'm also removing spaces I included in the original conversion of the PSET and LINE commands.   Every space character has to be read, determined to be a space, and thrown out by the parser.   That takes time, so it matters.    Moving on, there there are a few more tricks to do concerning the parser at this time:   

     i)   The original author's plotting routine uses the number '199' inline at the graphics command (and we carried that over when we converted to the X16 version graphics commands).   There's also a '199' spelled out in the bounds check IF/THEN.   But look!  The original author's variable 'G' is already set to '199' as constant during initialization.   Lets use it!   We can substitute the 'G' variable into our 'IF/THEN' and 'LINE' commands in place of the spelled out '199'.  Now, instead of the BASIC parser having (in each of those case and in all of the inner loop iterations) to fetch 3 PETSCII characters ('1' , '9' , '9') from that part of the program and then convert that sequence to a numerical (rather than character) representation in memory, it will just fetch the contents of variable 'G' which already has that numerical value in memory. 

     ii)   Did you know that, when you define a function in BASIC with 'DEF FN,' all that happens 'under the hood' is that BASIC uses a data structure in the place it keeps variables to hold a jump address and another as a scratch space?  Its true.  When your program later makes the function 'call', BASIC simply redirects/branches the interpreter's parser to the part of the BASIC listing where the function was originally defined, parses the entire sequence of characters comprising the function there; uses the scratch space to supply the value passed to the function to evaluate the expression and get a result; and then ultimately jumps back to the calling location.  Its really just a hidden sort-of 'GOSUB/RETURN' structure (but with arguably more overhead) which means it can take more time than simply including the full function at the place where you want to call it.   Defining a function IS useful if you have several different branches of your program listing (different locations or routines scattered through the program) that all need to use the exact same function -- and at the very least it saves typing and reduces overall program length.   But if the function is just called over and over and over from inside a loop (as is the case here), using the defined function is most often slower than just setting out the formula 'inline'.   For that reason, I'm going to dispense with the defined function in line 1000 at this time and just place the whole formula at the location where we set variable 'D1' (which has, thus far, been the variable assignment where that function was called).  

     iii)  This last parser trick originated, I believe, from the late, great, Jim Butterfield -- GURU of all things Commodore in the 80s and beyond.   This was in a newsletter or one of the many magazines that regularly ran Jim's columns.   OK, so, notice that in the plotting commands we have a value zero of '0' in the 'PSET' command (at the 'color' parameter) and a value of zero '0' in part of the 'IF/THEN' bounds check that tests the 'X1' coordinate?   We're going to change those zero characters ('0') to decimal points ('.')     It turns out that both '0' and '.' are parsed by BASIC to the zero value in memory; however, due to something technical about the way the parser works, the decimal '.' approach is measurably faster.   I think it has something to do with the way BASIC populates its floating point accumulator slots internally, but who knows.   Is this small change worth it?   Well, take a look at the following two command line exercises.  Both clear ('CLR') variable memory to a clean slate; reset the system TI$ variable; and then perform a 'FOR/NEXT' loop that sets two variables to the zero value 33,175 times.  When its done, it writes the elapsed time in the 'TI$' variable to the screen in HHMMSS format.   They're identical in every respect except that instead of the '0' character, we use the '.' decimal in the second exercise:


I want to get the elapsed time of this program from 11 minutes and 4 seconds to less than 8 minutes and 50 seconds.      That's a goal of cutting 134 seconds.    Just that one tiny change at two spots in the inner loop  (changing  '0' to  '.') -- when iterated over 33,175 operations --  looks like it can save approximately 5 seconds.  Not bad. 

Anyway, after the first pass with the optimizations / work from Parts  A, B, C and D, above, here is what our program looks like:


So you're not confused here are the variable re-names:

	'H' -------->   'O'
	'AA' -------->   'A%'
	'BB' -------->   'I'
	'CC' -------->   'Q'
	'D1' -------->   'R'
	'DD' -------->   'S'

Let's run it and see if any of that helped?


Fantastic!   We went from 11 minutes 4 seconds to 10 minutes 6 seconds, shaving 58 seconds off our original time .    And THAT is before even doing any of the more difficult, but more profitable optimizations still available to us!   I think we will succeed in reaching our goal of getting below 8 minutes 50 seconds in the end, but that is for the LATER POSTS in this thread where we will:

--  Make an improvement that should both speed up the program and get rid of an annoying visual glitch I see in both the original PLUS/4 and current output.

--  Simplify math operations further, if possible;

-- Move calculations (and parts of calculations) OUT of the inner loop;

-- Retool our bounds-check so as to preferentially branch early in one of the the two 'out-of-bounds' scenarios so as to avoid a bunch of unnecessary calculations as is the case currently (when we conduct bounds checking in a combined conditional branch).

-- AND, finally, we will dive head first into some technical details of the poop-pile that is Commodore style BASIC variable storage and look-ups and do that (clickbait:) "one little trick" that makes a major speed improvement by taking this knowledge of variable fetching/storage routines into account!

Stay tuned for the next post, hopefully by the 4th of July.   Thanks for reading.




Edited by Snickers11001001
  • Like 3
Link to comment
Share on other sites


In the first post, we picked a little BASIC bitmap graphics program to convert from the Plus/4 to the X16.  In the second post, we accomplished the 'functional' adaptation. 

Using turbo mode on the Plus/4 emulator, we found it took hours (2h, 42m+) to run the full plotting sequence.  But even before any optimization, our newly minted X16 version got things done in 'just' 11 minutes and 4 seconds.  

In the third post (just above), we did a first optimization pass with some 'easy' things (organization, shortening variables, simple efficiency tweaks).  We also paid homage to the gods of the BASIC parsing engine in a number of ways.  The resulting was we reduced the runtime down to 10 minutes and 6 seconds. Nearly a minute faster!   We're well on the way to my aspiration of a time less than 8 minutes, 50 seconds. 

Now we are going to turn to the more profitable optimizations.  Although they're more effective when available, we do these last because the changes involve shuffling things around and (probably) multi-purposing certain variables by treating them almost like registers.  The resulting BASIC program will be more difficult to understand, and a lot harder to debug if its not working perfectly already.  

Before we get started, lets implement a quality of life improvement.   When you switch from the bitmap (mode $80) BACK to the 40 or 80 column text-only screen (modes $00 or $02), the current ROM doesn't re-initiate the layers to the background color and it's sort of unreadable.  I'd guess that will change in some future update.  For now, we can do what's necessary manually from the command line when we switch to 40 columns, by the combination: 


Let's add that to our program listing.  We'll want to pause the output before that switch, so let's put a line that will 'GET' the current keypress (we will stick it in a string variable called 'KEY$' -- which is really just 'KE$' since as Commodore BASIC only cares about the first two characters of a variable name), and use an 'IF/THEN' to force a re-run of this line until there is a key pressed.  Then execution can fall through to our little clean up command (lets also 'LIST' the program there) once the user presses something.   


I think I'll also add a bunch of down cursor commands in the quote where we 'PRINT' the time at the end so it appears near the bottom left like the Plus/4 version.  Now, let's finally dig in to some more optimizations, continuing our fancy internet 'lettered list'.  

E. Making it faster WHILE making it look better.

Our first optimization is for both speed and a visual improvement.  Take a look at this portion of a screen shot of the program's output:


See that vertical banding/glitching?   I've stuck in a bunch of gaudy, ugly, arrows to line up with the vertical irregularities in the pixel plots below them, so you can see what I'm talking about.  That 'banding' is present in both our X16 adaptation and the original Plus/4 version.   And its not due to flaws in the underlying curve creation calculations.  (The Sine/Cosine functions from the C64, Plus/4 and C128 (and, thus, X16) BASIC are accurate to many more bits of precision than could possibly make any difference when translating the results to a relatively low-res plotting area of 320x200).  No, those glitches are what happen when you re-scale in a very crude way:  i.e., by starting with pixel locations that were based off directly sampling specific intervals along the underlying raw curve function, and multiplying them against a coefficient that doesn't divide evenly into the sample intervals.   

Those of you who have ever tried to use scaling settings in DOSBOX to resize a game's display from the original MSDOS resolution to a bigger window -- but where the scaling factor isn't an even divisor into both the original and destination resolutions -- will have seen something similar.  

The 'scaling down' multiplications here are the ones where the original author introduced variables 'X1' and 'Y1' which were assigned by multiplying  '.85' and '.9', respectively against the 'X' and 'Y' variables previously generated directly from sampling intervals at points on the underlying functions.  For several reasons, I believe this scaling was a last minute addition by the original author (or, more likely, the editors of the magazine where the program listing appeared), but that's besides the point.  You're probably not reading this thread for speculative software anthropology.  

Bottom line: (1) the vertical banding looks bad; and (2) the extra multiplications in the inner loop make things slower (they cause extra parsing and the introductions of variables 'X1' and 'Y1' are bound to result in serious performance losses -- maybe even in part due to variable management issues we'll talk at some point before I'm done with this excessive odyssey about a very short program ).

Let's remove those crude scalers.  The relevant program lines go from this:

5 X=I+O/B+E:Y=S-O/B+F:X1=.85*X:Y1=.9*(G-Y):IFX1<.ORY1>GTHEN7
6 PSETX1,Y1,.:LINEX1,Y1+1,X1,G,1

to this:


As you can see, we just rolled the 'subtraction from variable G' into the original expression to calculate 'Y';  also, we had to change the 'PSET' and 'LINE' commands to pass them 'X' and 'Y' instead of passing the scaled 'X1' and 'Y1' variables, which we've eliminated.   

There's a risk here:  When we first got the program working, we did bounds checking only as to the left and bottom of the bitmap.   If this was scaled down by the original author to keep it from spilling off the right side of the screen when plotted at full size, we'll get an error.   Well, let's 'RUN' it and see:


Shazam!   In my view, that looks WAY better.  No vertical banding/glitching, and it fills the screen nicely.  It came close but didn't spill off the right side so we didn't pay for our omitted right side bounds checking with an error.    

And, wait a second, what's this?  OH.MY.WOW.... Look at that time!!!    8 minutes, 40 seconds.   I knew we would also be speeding things up, but this shaved a huge amount off the elapsed time from the prior version!    So much that we've already bested our goal of getting this sucker below 8 minutes and 50 seconds!   

I guess we're done...  right?   

Just kidding.  I've got more to do, and now I'm really curious (and, if I'm honest, tending toward unnecessarily manic) to see to just how far we can take this.  Now, part of the speed up here is for obvious reasons:  For each of the 33,175 iterations of our inner loop, we eliminated 24 byte parses; 4 variable fetches; and two multiplication operations.  But I suspect the biggest part of the optimization was a consequence of that variable management issue I keep hinting at.  Sorry, that's still for very last part of my write up, but I'll get there (eventually). 

F.  EVICTING calculations, parts of expressions, and slow things from the Inner Loop:

I have to confess a math error.   In the prior post I said the outer loop indexed by 'O' runs 128 times.   I did the math in my head and overlooked the need to add 1 because 'FOR/NEXT' loops run from their starting point to ending point 'both inclusive.'   If the interpreter adds the interval value to the current indexing variable and the result is less than OR equal to the ending value specified in the 'FOR' statement, it will loop again.  So there are actually 129 outer loop iterations.   

As you will recall, we did a temporary alteration of the code listing in a previous post to add a counter, and this helped us figure out that the inner loop runs 33,175 iterations over the entire run of the program.    While not every iteration will plot to the screen (if coordinates fail our limited bounds checking, the program skips the line with the 'PSET' and 'LINE' commands), every other operation inside that inner loop must be parsed and the operations performed those same 33,175 times.   If we do the math (129/33175)*100 we see the implication:  The things happening in the inner loop represent about 99% of the impact in terms of speed for the work done by the program.   That means if we can move stuff out of the inner loop to the outer loop (or even just kick them from the inner loop, period), we will create some real time savings.    Let's do that now. 

I'm going to walk in detail through the first two of these, and then more briefly describe the rest.   Here's the program lines containing our current outer/inner loops so you can refer to it when reading what follows:


     i.     Lets start with that 'O*O'  calculation we see up there.   Both our outer loop AND our inner loop calculate 'O squared' and we'll change both, but its in the inner loop where it will help.    Remember 'O' is now the indexing variable for our outer loop, so the result of that calculation changes every time the outer loop iterates.  But since the INNER loop runs nestled inside the outer loop, the value of 'O' for any full run of the inner loop (as part of any given iteration of the outer loop) remains the same.   We know that 'O*O' is a discrete operation that can be plucked out because (a) its value does not depend on the inner loop variable (or some other variable that is altered earlier within the inner loop); and because (b) the hierarchy of evaluating expressions in BASIC puts multiplication/division at a higher priority (done first) before addition/subtraction, so that the 'O*O' expression can be taken out and replaced with a variable without altering the evaluation of the larger expression in which it is a part.   Thus, there is NO REASON to perform 'O*O' repeatedly some 33,175 times in the inner loop!   Here's what we're going to do:

Let's introduce a new variable 'J' and, immediately after the start of the outer loop, lets set it with 'J =O*O'.    That calculation being located right there means every time the outer loop iterates and its indexing variable, 'O', changes, our 'J' value will get updated before proceeding to the rest of the outer loop and before commencing that full sequence run of the inner loop.   So we can replace the 'O*O' calculations in both the outer AND inner loops with a reference to variable 'J'.    For every single iteration of the inner loop (that's 33,175 times!) we have now replaced two variable fetches of variable 'O' with a single variable fetch of variable 'J'; and we have skipped that costly 'multiplication' operation. 

     ii.     Same thing with the calculation 'O/B'.   Division is even slower than multiplication.   We already know 'O' doesn't change during the run of an inner loop.   And we know from a few posts up that the variable called 'B' is a constant that was set at program initialization, and not altered anywhere else in the program.   So TWICE within every iteration (all 33,175 of 'em) of the inner loop, we have been doing 'expensive' division operations that don't need to be in the inner loop.   Every time it does that operation in a new iteration within the same run of the inner loop the result of that calculation is the same as the prior iteration!   We'll evict these calculations by introducing another new variable, 'T' this time, and set it early within the outer loop as:  'T=O/B'.    Then we replace the two 'O/B' operations in the inner loop with 'T'.  Again, we can be confident its ok to do so because the hierarchy of operations means that, before our alteration,  the 'O/B' division operations were evaluated at those locations prior to the interpreter proceeding with the neighboring addition/subtractions against the result.   Therefore replacing that calculation with a variable fetch does not affect the evaluation of the larger expression.

Here's where we are in terms of updates so far to the lines containing our loops:


So hopefully you get the gist of what we're doing here.   I immediately see two more obvious -kick something out completely - evictions and I'll explain them only briefly before we try another 'RUN' to see how we're coming along:

      iii.    Swapping the order of the commands in the 'PSET' and 'LINE' part of the listing allows us to eliminate the need for a '+1' math operation that gets parsed and processed 33,175 times by the 'LINE' command in the inner loop.   It works because the 'LINE' is plotting background color and the 'PSET' is plotting the foreground (black) pixel point.  Only instead of starting the line (erasure) operation one pixel below ('Y+1') the current coordinates, we just perform the background line first starting right at the current 'Y' and then plot over that background pixel with the 'PSET' operation setting our new pixel after that.   Same exact visual result, but we saved the parse of two bytes ('+' and '1') and an addition operation for every run of the inner loop that results in a pixel plot (i.e., most of the iterations) by doing it this way: 


     iv.     After the program sets variable 'R' to the result of that stack of Cosine functions, It sets another variable 'S' to the resulting value held in 'R' times the value of variable 'C' -- a constant set during initiation of the program.    That 'R' variable is holding an intermediate value; and 'R' is not used anywhere else in the program except to be multiplied by the constant 'C' to put a value into 'S'.   By and large, you should not store intermediate values where it is not necessary to do so.     It turns out that parsing one set of 'added' parentheses during expression evaluation is still faster than parsing the extra variable fetches and stores (and net bytes of additional character parses to perform the 'R=S*C' expression), which means we can simply simply set 'S' to be the result of that stack of Cosines AND the multiplication by 'C' all evaluated as a single expression, and without using variable 'R' to hold an intermediate value.   We put the stack of Cosines in parentheses since that whole mess needs to be evaluated first, and the result multiplied by 'C' though.   Still, bye bye variable 'R' -- and here's the revised line: 


OK, so here's our current listing, which we should save:   


Now, let's 'RUN' it and see where we are on the question of speed:


Not bad.    8 minutes, 2 seconds elapsed.  Another good time savings, and by now its safe to say that we've literally blown the doors off our initial optimization goals.    Lets move on.   Its like they say in that dumb party game:   'How low can you go?' 

G.   Transforming a Branch Intended to Avoid Errors into a Time Savings:

Lets start this particular optimization with an analogy that, believe it or not, is surprisingly useful in its parallels with our program:   

Suppose you've decided to make Tacos for your grand-child, niece/nephew, (or whatever) who is visiting you for the weekend.   You'll need to take the ground meat out; fry it up with taco seasonings; chop onions; shred lettuce; dice tomatoes; mince cilantro; grate cheese; heat taco shells; plate everything up; and sit down with the kiddo to partake in the feast.   But there are a couple wrinkles:  (i) You don't have any idea if the kiddo even LIKES tacos, but you know the youngster is a spoiled brat who will have a tantrum if (s)he doesn't like that type of food; and (ii) you have never actually prepared tacos,  since you normally go for TV dinners or take-out when you don't have company -- so its unclear if your tacos will actually be tasty or even edible or not.  

In our program, we KNOW that we also have two wrinkles within that inner loop.   The way the set of curve progressions work, (i.e., in order for the image to develop correctly), the initial parts of several curves actually generate horizontal ('X') coordinates that are less than zero and, thus, off the left side of the bitmap.  Also (having watched this plot out too many times, now) I can tell you that we also know that later during the run -- although not quite apparent from the final image that appears on screen -- the formulas do generate some vertical axis ('Y') coordinates that are greater than 199 and would be below the bottom of the screen (causing an error were it not for the fact our program does do bounds checking for whether 'Y' is greater than 199). 

Lets call that boundary issue with the x axis the question of whether the kiddo likes tacos; and the boundary issue with the y axis the question of whether the food actually turns out to be any good once you're done cooking.   This is an apt comparison if you look at our program listing carefully : The final 'X' coordinate is generated within the inner loop by just a couple of simple additions to the value of the indexing 'I' variable from that inner loop.   That's like the easy question of saying:  "Hey, kiddo, do you like home-made tacos?"   Simple.  But the final 'Y' coordinate requires (within every iteration of the inner loop mind you,) a big string of slow calculations:  There's a square root ('SQR()') command, which is slow indeed;  there's a bunch of multiplications, a stack of Cosines (each one being quite slow) and some additions/subtractions at the end, all involving a significant number of variable fetches.  Sheesh.   That is a LOT of work.  But that's why deriving the 'Y' is like the cooking/meal prep process in my analogy.   You don't know how THAT part of things turned out, (if the meal is any good or the 'Y' coordinate is on or off of the screen),  until you've done all the hard work. 

Thus far, we we have had our program test BOTH of those 'wrinkles' -- the 'X' and 'Y' boundary issues -- with a single 'IF/THEN' check that evaluates both conditions.   And we do this ONLY right before the plotting.   With the 'Y' coordinate, we don't have much of a choice:   Just like the question of whether the chef cooked up a decent tasting taco plate, we only know what 'Y' looks like after doing all that work.    But with the 'X' coordinate, we have been doing our test VERY late in the game.   Its like getting all the food out, frying up the meat, cutting the veggies and getting everything plated up and only then asking the spoiled little brat if he/she even likes tacos.    That's the worst possible time to ask that question, because you've already done all the work. 

Hence this optimization.    Let's re-arrange things.   First, let's add yet another variable to the outer loop.   This time, we'll take the part of the 'X' derivation from inside the loop and simplify it further.   In the eviction process above, we already got rid of the 'O/B' and replaced it with variable 'T' in the inner loop.    What we're going to do here is go one step further in the 'X' part of the inner loop calculations.  We create variable U directly after setting 'T' in the outer loop, and we do so as follows:  'U=T+E' which has the effect of evicting variable 'E' out of our inner loop too because now 'U' substitutes for 'O/B+F' (previously, it has been summed up with the current value of the inner loop indexing variable 'I' and the result of that previously evicted 'O/B' calculation).     

Now we can split up our heretofore conjoined bounds checking 'IF/THEN' statement.   Immediately after the beginning of our inner loop, we'll do the (very simple/cheap) calculation to generate 'X' so we can immediately see if 'X' is off the screen.   We know it happens some percentage of time during a full run.  But I will confess here, that before taking the trouble to write up and test this part, I made it my business to KNOW the answer.   ( I again made some temporary changes to the program listing to actually count this up).   The answer is that during a full program sequence, the program skips plotting a pixel because the 'X' value is less than 0 a total of 1176 times, or about 3.5% of the 33,175 inner loop iterations.   

What a waste!   We have been 'cooking the entire meal' before checking this issue.   We have been doing all those slow calculations to derive 'Y' when the 'X' value would prevent a pixel plot anyway.   With the changes we just made, i.e., by evaluating 'X' at the outset of the inner loop, we know if it fails the bounds check we can jump immediately to the next iteration of the loop.   If 'X' is bad, it doesn't matter what 'Y' is, which means this approach skips all those very slow calculations eventually used to derive 'Y', as it turns out,  in about 3.5  percent of the inner loop iterations.

And that's the lesson of this subsection:   Wherever possible, consider whether you can avoid doing costly/slow work before a branch when  it is predictable  that a material percentage of the time the branch will in fact occur and make that work moot. 

OK, here's the resulting listing:


As you can see, if the test 'X' reveals the pixel is 'in bounds (i.e., no  branch because of a failed bounds check of that easy to prepare 'X' value), program execution can then fall  through to all those slow calculations used to derive 'Y.'   Then we have to put in a second 'IF/THEN' statement to bounds check 'Y' and prevent plotting when it exceeds the bottom coordinate of the screen, but that's easy.    In the end, our code has grown:  We have added a second 'IF/THEN' -- and that has a real cost.   But, I am actually betting on the 33,175 extra 'IF/THEN' statements actually having less total cost, than the time savings we will achieve by avoiding doing the work of that extremely expensive 'Y' derivation in 1,176 of the iterations of our inner loop.   In other words, the hope is the 'net' result will be more speed. 

Am I right?  Well, let's 'RUN' it and see!  Here we go:  


OK.   We're now at 7 minutes, 46 seconds.    It was worth it in my view.    We've saved another 16 seconds off the total time with just this change. 

Well, then.  So far, optimizing is going even better than expected, and already we are seeing the program  runtime clock in more than a minute faster than the goal we set at the beginning of the optimization process.      

That said, this post is getting really long, so I am sorry to say I'll have to save the write up about BASIC variable storage and fetching -- and the changes we can make to optimize things further in view of that topic -- for the next time.   

You'll have to scroll past a couple replies below, but it will be in this same thread.    Sorry I couldn't get to it in this post, but its sort of interesting stuff and I want to take the time / space to do it justice.  Stay tuned.  






Edited by Snickers11001001
  • Like 3
Link to comment
Share on other sites


In the prior posts we found a Plus/4 graphics program and made the changes to adapt it to the X16 BASIC bitmap graphics commands (ok, and yes, a bit more fiddling to get it working).  Our original X16 adaptation took 11 minutes, 4 seconds to plot its output.  In a first round of optimizations, we hit the low hanging fruit (organization, simple math/efficiency tweaks, and parser tricks), to get the plotting time down to 10 minutes, 6 seconds.  

In my last post, we did a second optimization pass.  This time we really pulled out the stops, cutting the plotting time down to about 7 minutes, 46 seconds. To get there were did more aggressive things like taking out an ugly/slow scaling operation, evicting expressions/parts of expressions from the inner loop, and changing our bounds check branching in a way that was conscious of the two very different scenarios we had previously folded into a single conditional branch.  

Now its time to finally talk about how the (Commodore 64 based) X16 BASIC interpreter uses simple variables like those throughout our program.  This will reveal a couple ways to shave some more runtime off our little program before giving it a final polish. 

For this discussion, we will focus on 'scalar' variables (sometimes called 'simple' or 'regular' variables).   I'll call them 'simple' variables here.  We'll skip arrays for now, since we don't have any in our program.  Simple variables in BASIC are named storage references such as "A" "X1" "G$" or "I%" that can hold a single value in the form (floating point, integer, or string) corresponding to the variable's designated type.  Floating point number variables have no additional designator after the name.  Integer number variables have a percent '%' sign type designator.  String variables use the dollar sign '$' to signify.  

Under the hood, each simple variable that is 'in use' (i.e., it has been initialized by having its value set) gets its own 7 bytes of memory at a location starting just beyond where the BASIC program listing ends in memory.  The first two bytes of each 7 byte sequence hold that variable's name (with high bits added or not to designate the type).  Some or all of the next 5 bytes hold the variable's assigned value in the case of a numeric variable, or if it's a string variable, a pointer with the memory address for the actual string, and a byte holding the string's length.   

Each time the BASIC interpreter is asked to assign a value to new variable as program flow proceeds, it allocates the new variable its own 7 bytes which are stored in memory after all the 7 byte sequences previously assigned for other variables.  When the value of a variable changes, it continues to use those same 7 bytes worth of storage at the same memory location they were first placed.   There's no 'deletion' of a variable.  Even if you assign one a null value, it will continue to use those 7 bytes right where they were first allocated.  (You can only use 'CLR' to clear ALL variables).  As noted earlier in the thread, the 'DEF FN' function assignment uses regular variable storage.  The function itself gets 7 bytes in simple variable space to hold its pointer and some other stuff; and the 'placeholder' (sometimes called dependent variable) used in the 'DEF FN F ([placeholder])' statement also gets its own additional 7 byte allocation within the simple variable space. 

When BASIC is asked to do something with a variable (such as 'POKE A, 127' or 'A=1.61803399' or 'PRINT A*A') it goes on a little scavenger hunt.  It begins with what it knows to be the 16 bit memory address marking the start of simple variable storage and evaluates whether the first two bytes it finds there are the variable name/type it's looking for.  If not, it skips forward to the next group of 7 bytes; reads/evaluates the 'name/type' bytes there to see if this time it's found the one it wants ...  and so on,...  and so on,... and so forth.   The interpreter only knows its done searching if it either finds the variable, or gets to the memory address it knows is supposed to be the start of the next category of information in memory (i.e., the beginning of array space).  

To appreciate the implications of this way of fetching and storing variables, I want you to imagine you purchased a really REALLY cheapo cell phone.  It has no alphabetization or search function for the contacts list.  Each contact you add just goes into its own entry with the contact's name on top, and the contacts can be displayed, one contact per screen, and paged through only in the exact order you put the contacts in.  So every time you want to call or text a contact, you begin at the start of the sequence and then flip through each and every entry, one-at-a-time,  saying to yourself  "nope, nope, nope, nope, nope, nope,..." until you see the name of the contact you're looking for (or get to the end of the series and decide you need to add the person as a new contact).    The more contacts you put into that clunky contact list, the more work it is to flip to the entries for the most recent additions. 

THAT is how BASIC's interpreter has to fetch and store variables.  Each new variable that gets added (in the order encountered during the flow of program execution) is slower to use (fetch/store) than all the variables initialized earlier in the run of the program.  Now, of course, the internal machine language routine the BASIC interpreter uses to flip through the variables storage area and find the one it's looking for is fast, and probably benefits a LOT from the extra CPU cycles on the X16.    But it is STILL slowing things down.  

With this in mind, is it any wonder that getting rid of the 'X1' and 'Y1' calculations from that problematic scaling operation in my last post gave such good results in terms of speeding things up?  Not only were those variables the slowest ones (introduced last in the program flow), but they were part of expressions that called other 'late in the game' variables, and; then, 'X1' and 'Y1' were used in bitmap 'PSET' and 'LINE' commands that required multiple fetch operations of each of them. 

Now we know the problem, let's go forth and optimize some more.  The solution is easy and can lead to further opportunities:

H.  Initializing variables the order of frequency of use, particularly within the inner loop.

Ideally, you would do this after you optimize everything else (so you don't have to keep re-tallying as you change things).  But I often can't resist the temptation to do it earlier in the process, even if it means I'll have to further tweak the order of initializations again later. 

The tallying process is easy:  Here's a screenshot:  


As you can see, I went over the listing and changed the color of the variables to be tallied up, for emphasis.  Then I listed them out, and counted them up with tick marks.  By the way, if the inner loop is a 'FOR/NEXT' structure (as here), and if variables are passed to the 'FOR' statement for the 'STEP' and 'TO' parameters, then you should not count those instances as being in the inner loop.  The values of those parameters are just pushed onto the BASIC stack when the 'FOR' is executed, and variables used to pass those parameters are not fetched again (or altered) in the loop iterations (in contrast with the indexing variable, which is updated each time the loop runs). 

Now we have the count, we will 'assign' something to (and therefore initialize) the variables in order of priority at the very very beginning of our program.   Here, we can use line 1 which has plenty of room.   We just set all those variables to zero, sequencing them from the most frequently used to least frequently used in the inner loop.  Since we previously learned that a decimal character '.' parses as the zero value, we'll do that here just for style points (and as something of a homage to the spirit of Jim Butterfield and the other Commodore wonks who figured this stuff out).  

If you have a 'tie' in any of your tallies (two variables with the same fetch/store count in the inner loop) you can break the tie by considering how many additional times, if any, each of the variables with the same inner loop tally get used elsewhere in the program.  

Concerning variables destined to be assigned to act as constants in the original initialization, those that participate within the inner loop should be put in the preliminary initialization sequence according to their inner loop tally.  They will still be set to their intended constants when the interpreter gets to that part of the program initialization, but they'll also retain their 7 bytes' spot in memory based on the priority order we selected when we first initialized them. 

Here's our resulting listing: 


And the resulting runtime:


Nice!  Nearly 40 more seconds of time savings!  That 7 minute mark is so close! 

I.  Getting Wonky.

Now that we know our fastest variables (so far) we can consider whether there's some places where we can take advantage of one or more of the fastest variables as sort of pseudo scratchpad registers within the inner loop.  This will not be possible in every program and it takes some thinking (and maybe changing things in a way that might require you to tweak your tallies).     

As you can see from the last listing just above, our 'Q' variable is currently used extensively in the inner loop.  One expression sets the value of 'Q' and MUST be done separately to set an intermediate value.  The reason is that the resulting value is then used 4 times in the stack of Cosines function and it would not be 'expression simplification' by any means to copy the expression that sets 'Q' to every other place 'Q' is later used in the loop.  

But we can speed this up a bit.  We know after the value of 'Q' is set, its only further use within the inner loop is to be fetched (read) as part of the Cosine stack.   'Q' is not further modified.   We also know that the 'S' variable gets assigned the result of that Cosine stack (as modified by the '*C' operation we moved into that expression in a prior optimization).  Then, finally, 'S' is used in an addition/subtraction expression to finally derive and assign the final vertical pixel coordinate to variable 'Y'.   This means that 'Q' and 'S' are just temporary holding for values on the way to getting the final 'Y' value in each iteration of the inner loop. Significantly, 'Y' is not used any earlier within the inner loop and indeed, 'Y' does not participate in any prior calculations.  'Y' is free until it is assigned near the end of the loop.  

Here's what we're going to do:   We will throw out the 'Q' and 'S' variables.   In the expression that currently assigns a value to 'Q', we will actually temporarily assign that value to 'Y' instead.   Then, in that stack of of Cosines, we will now be putting 'Y (with that temporary value) in there in place of 'Q' at all instances.  And, instead of assigning the result to 'S' and then doing one more calculation to derive the final vertical coordinate, it looks to me like we can also fold that final 'Y=G-(S-T+F)' expression right into the one with all the Cosines.  We will assign the outcome of the resulting combined expression BACK into the 'Y' variable. 

We go from:

5 Q=SQR(J+I*I)*D: S=C*(COS(Q)+COS(Q+Q)+COS(5*Q)):Y=G-(S-T+F): IFY>GTHEN7


5 Y=SQR(J+I*I)*D: Y=G-(C*(COS(Y)+COS(Y+Y)+COS(5*Y))-T+F): IFY>GTHEN7

Does that change and the reuse of 'Y' in such a way freak you out?  It shouldn't.  

REMEMBER!  Its OK to have the variable name on both sides of the '=' sign for a variable assignment. (i.e., 'Y=Y+1').  The interpreter knows that what you mean is  "fully evaluate the expression on the right of the '=' sign using the CURRENT value of 'Y' and put the result of evaluating the expression back into the same variable, 'Y', at the end of the process.  The old/original value participates on the right side, and the assignment of the new value occurs only after the old value held by the variable is done being used. 

We eliminated 'Q' and 'S' from the entire program, so obviously we strike them out of the initialization sequence we previously put in line 1.   Redoing our inner loop variables tally we see (unsurprisingly) that 'Y' is now the most frequently used variable since we're treating it as something akin to a 'scratch' register on the way to deriving its final value.  So 'Y' now gets moved to the very front of the initialization sequence making it the fastest variable.  All those times a fetch is necessary for 'Y' , the interpreter will find it immediately without having to do any of the 'nope, nope, nope' stuff.  I am not sure I am explaining this very well, but I believe it will ultimately give us another time savings -- and of more than a few seconds.  (I really REALLY want to get this below 7 minutes runtime!) 

Lets 'LIST', 'SAVE' and 'RUN' to see what we accomplished.



And there we are!   We've done it.  A total time below 7 minutes.  Fairly impressive, considering we started at over 11 minutes and our goal at the start of optimization was to get down to just under 9 minutes.  And by the way, and I can't emphasize this enough:   it took ever so much longer to 'write it up' than it took to actually come up with the changes made throughout this thread.

J.   Putting on a final polish!

Before wrapping this up, I'm doing a final polish, which should give just a hair more speed and really finish up the adaptation / conversion / optimization.   I'm getting tired of typing so let me briefly summarize the following listing:


 - I gave the program a final crunch. I squeezed as much into certain lines as possible, taking care to fix any line number references that needed to change.  The real goal here was to get the lines comprising our outer and inner loops down to 4 total lines instead of 5.  Even if I crunch it differently, the two conditional branches in the inner loop won't let us get the number of lines used by the loops any smaller.  Still there should be a small time savings. 

 - I've assigned variables 'L' and 'K' in place of the '.5' and 'A*A' parts of expressions previously used in the outer loop.  This is not for speed (they're not in the inner loop), but for space savings so I could squeeze our 'X' bounds check in at the end of the crunched line 3.  

-  I've flip-flopped the branch condition and structure of the 'Y' bounds checking.  The reason is that the most common outcome of that evaluation will be "yeah, 'Y' is fine, now proceed to plotting the pixels" and it's ever so slightly faster to put the plotting commands immediately after the 'THEN' rather than (a) having program flow fall through to the pixel plot when the coordinate is ok; and (b) executing a jump 'over' the pixel plot routine in a separate line number when its not. So instead of testing whether 'Y' is greater than 199 ('Y>G') and, if so, branching so as to skip the plotting; we now test the opposite (if 'Y<G') and if so, plot.  Probably just a minor benefit if any, but that's what I did.  

-  Beyond printing elapsed time to the screen, I've had it display a name for our program when its done plotting.  My college-aged daughter claimed the naming rights.  She has dubbed this routine 'The Proteus Oscillator' because it "looks like a water god is making waves..."  (She also challenged me to make a Python version.  This may present a problem, as I don't have expertise in Python.  But I'm going to give it a try in the next few weeks I guess.  Why not?)   

-  Finally, I added a few lines with a 'blurb' at the end of the program listing.  Without even using 'REM' statements!  Since the program flow never reaches these lines, its not a problem.  Of course, there's the obligatory shout-out to the magazine that was the original source of the program.  They never credited the original author by name, so I can't either.  

OK, that's all folks.  Sorry this thread turned out so very verbose.  I have been trying to target a particular level of 'nascent enthusiast' who is just starting with X16 and playing with BASIC.  If anyone finds this of value, it will have been worth it.  

One final screenshot with the output of the program in its final form.  



Edited by Snickers11001001
  • Like 4
Link to comment
Share on other sites

  • 1 month later...

I figured the thread was due for post-script of sorts....  call it something of a denouement  to our little adventure.

(Also, I forgot to mention:  If anyone sees any typos, errors, or unclear/confusing things in any of the above posts, please message me so I can fix em).

On reflection, it seemed only right to take all those optimizations we made for our X16 conversion of this program, and backport them to the original Plus/4 platform in order to see what they accomplish.   How much can we improve from the Plus/4's original 2 hours and 42 minute plotting time?!

Well, here's the listing and outcome, as reflected in a couple screen shots from 'plus4emu':



The result on the Plus/4 was actually really good ...   it went from a 2 hours 42 minutes plotting time, down to 1 hour and 37 minutes!     Not too shabby. 

You might notice I added one more optimization on the Plus/4 version, which I've sort of called out in purple in the listing above.  Bonus points for anyone who can both (a) figure out what the heck I'm doing there; and (b) hazard a guess as to why the tweak provides a tremendous speed improvement on the Commodore Plus/4, but actually TAKES LONGER when added to the fastest X16 version posted in this thread above.     

Moving on, my last post mentioned that my college aged daughter threw down the gauntlet and challenged me to try and covert this routine to Python, which is her preferred language for all her scientific / lab / analysis stuff at school.   Well, she kept pestering, and wasn't about to let me off the hook.    So I downloaded a fairly all inclusive Python setup she suggested (Anaconda some such with an IDE called Spyder) and gave it a shot.   

Only problem:   I have no genuine expertise in Python.   I played with it a bit in 2006 or so  just because it was included to run some bloatware on a computer I bought.   So be sure, I can look at a Python program and know pretty much what its doing...     but I don't 'know' Python by any stretch.  

But...  with the help of google and the excellent reference guides from the official Python site and a few of the libraries I used,  I was ultimately able to cobble something together over the past few evenings.  The result was that I got Python to spit out what appears to me to be a pixel-for-pixel correct (comparing with the X16 and Plus/4 versions) output.  There aren't any Python specific optimizations because I frankly would not even know where to begin.  And besides, generating the image below took only seconds on the OLD (circa 2012) Core i3 machine I used for the project anyway!  


For what it's worth, when I emailed a copy to my daughter to say "ok, I did it, quit bugging me about it," she responded by saying (of my Python code):   

"  um... that's not really how you are supposed to do Python..."

She also accused me of not having any 'class' or something!    😆  

Yes, I know she wrote 'classes' -- but tend to dislike the 'abstraction for abstraction's sake' object oriented mindset.   Also, this really didn't need it and, in my view, probably would not have benefitted.  And anyway,  'get the hell off my lawn!'      

For now, all I wanted was to cause Python to produce a 'proper' output.   So I'm calling this good enough. 





  • Like 3
  • Haha 1
Link to comment
Share on other sites

  • 2 weeks later...
Just now, ZeroByte said:

I'd guess that maybe you switched from using simple variables to using an array, with that array being the first variable initialized....

I've not benchmarked that, but it feels to me like that would only slow things down if still using it in BASIC computations. Math still has to be done to compute array index values on top of just getting at the data in the individual values.

Link to comment
Share on other sites

It all depends on how much of a penalty it is for the parser to interpret the array references vs. a potentially faster (I have no idea - just a guess) means of accessing the memory once it's known that an array is being used.

An array can be referenced with a memory offset so you can jump straight to the right spot in memory whereas the simple variables are located via a linear search, which is going to incur time loss in the various logic statements (all of the "is this it? Nope." stuff. Granted, using Y as a guaranteed one-shot hit for the most common operations is probably even better than using an array, as the array index decode is going to incur multiplication.

I'm not a BASIC guru by any stretch, so I really don't know what sort of optimization was available to nearly double the performance aside from things like embedding an assembly routine in DATA statements and POKEing those into memory at the get-go. But that really isn't in the spirit of this exercise, so I'm quite sure that's not it.

The only other idea I have is that there was a way to simplify the equations with faster math - kind of like the famous fast 1/sqrt() function in Quake3 Arena... except for BASIC. Honestly, this is more likely to be what he's done.

Link to comment
Share on other sites

On 8/25/2021 at 1:34 AM, voltage said:

That's a huge speed boost.  Are you using fixed point math in place of the default rom routines?  That's what I would have attempted.

I've actually thought of something along those lines but with sine/cosine table in ML with a little USR() routine that coughs up a lookup.     But that's not quite on the menu (yet!).

On 8/25/2021 at 8:47 AM, Scott Robison said:

Replaced TI$ with "000347"? 😄


On 8/25/2021 at 8:48 AM, ZeroByte said:

I'd guess that maybe you switched from using simple variables to using an array, with that array being the first variable initialized....

I've tried this, and in game type programs especially it actually helps.    Even the first array is slower than about the first 10 or 12 initialized scalar variables because of the interpreter character fetch (parsing) overhead.   If you have a variable array "V(n)" and you refer to it in the program with V(1) its got a bunch of extra fetches and a numeric to FAC conversion to do, plus some added overhead because even though BASIC finds your first array at ARYTAB it also runs some extra tests to compare the index you supplied and the dimension of the variable as dimensioned.  Also, once there are more than 10 values 'V(0) to V(9)' there's a linear increase in time as you have additional digits of the array index value to parse, i.e., 'V(5)' is faster to parse than 'V(27)'.  Alternatively, if you use a regular simple/scaler variable to tell BASIC which element you want e.g., N=A(V), you incur not an extra cost for the 'fetch' of the variable that contains the element number.  And its maddening to program with!   But it does help create a rather more consistent variable fetch time than just using 24 or 30 scalar variables. 

But that's not what I did here.

1.  I did not use any assembly/machine code routines.   There are no SYS calls, no USR() calls, no pokes and peeks.     

2.  I did not add data statements.   

3.  The program does not load anything from the file system.   No external data is loaded from any source. 

4.   The previously fastest program version (the one that clocked in at 6 minutes, 52 seconds above) was 709 bytes long.  This version is 878 bytes long.  42 of the new bytes are a print statement and string;  and 23 more bytes are fluff to display a 'progress bar'...  

5.   Its not using turbo mode or any modification of the emulator, ROMs, host operating system or anything like that.  

I'll write it up tonight when I get back from some medical stuff today, but it comes down to looking hard at the precursor calculation that gets fed into that COSINES stack and which is based on 'Pythagorean Addition' of our outer and inner loop indexing variable values. 

Bottom line the program spends about 1 minute and 39 seconds preparing a lookup table as a 2 dimension array, and the rest of the time plotting to the screen.   

More this evening.   Cheers!   


Edited by Snickers11001001
Link to comment
Share on other sites

35 minutes ago, ZeroByte said:

It all depends on how much of a penalty it is for the parser to interpret the array references vs. a potentially faster (I have no idea - just a guess) means of accessing the memory once it's known that an array is being used.

An array can be referenced with a memory offset so you can jump straight to the right spot in memory whereas the simple variables are located via a linear search, which is going to incur time loss in the various logic statements (all of the "is this it? Nope." stuff. Granted, using Y as a guaranteed one-shot hit for the most common operations is probably even better than using an array, as the array index decode is going to incur multiplication.

I'm not a BASIC guru by any stretch, so I really don't know what sort of optimization was available to nearly double the performance aside from things like embedding an assembly routine in DATA statements and POKEing those into memory at the get-go. But that really isn't in the spirit of this exercise, so I'm quite sure that's not it.

The only other idea I have is that there was a way to simplify the equations with faster math - kind of like the famous fast 1/sqrt() function in Quake3 Arena... except for BASIC. Honestly, this is more likely to be what he's done.

In Microsoft BASIC for 6502, as used in Commodore legacy machines, a variable reference is stored as it's literal name, so an array of integers named IA% is stored as the three characters IA%.

To access a variable at an index, the reference is stored in its fully expanded form: IA%(index) (where index can be a literal or another variable).

So to access a scalar variable, you have to read the name and look for it in the variable table.

To access an element of an array, you have to read the name and look for it in the variable table. Then you need to get the size of each dimension of the array. Then you have to read the index expressions from the variable use. Then you have to do math to multiply and add each index (if a multi dimensional array) to find the final address relative to the start of the array in memory.

If your index is a constant, BASIC has to interpret it at runtime (convert the string "10" to the number 10, for example). Or you can use a variable, which means looking up another variable in the vartab.

In any case: the multiple levels of indirection will make a subscripted array access slower than a single scalar variable. When you really need an array, that's great, but scalar is faster when you don't need an array. Below screen show shows 10 seconds accessing array indexes in a loop and 3 seconds for scalars. I could replace 42 & 47 with variables to make them faster, but then they'd still require lookup of an extra variable each.


Link to comment
Share on other sites

Here is a better benchmark (I think). First the program:


And the results:


The first number is just how much time it takes to do a for next loop 1000 times without anything in the body. Used to subtract out "overhead"

The second number is how long it takes to interpret two constants, use them to lookup locations by index in an array, add the two locations, and assign to x (1000 times).

The third number is how long it takes to lookup locations by variable index in an array, add the two locations, and assign to x (1000 times).

The fourth is how long it takes to interpret two constants, add them together, and assign to x (1000 times).

The fifth is how long it takes to add two variables and assign to x (1000 times).

Working backward, we know it takes 156 jiffies to lookup three variables (b, c, and x), read two of them, add them, and assign one. Just for simplicity, let's call that 7 operations per iteration, 7000 operations total. That gives us an average of 371 microseconds per operation.

The previous step (adding constants) took 170 jiffies longer to do 2000 conversions from digits to floating point which it did in lieu of 2000 variable lookups. This will vary based on the size of the number, but we'll just call it 1416 extra microseconds to convert vs lookup.

Before that it took 248 jiffies longer to do math with array lookup vs simple variables. Two extra array index lookups per loop iteration equates to 2066 microseconds to do an array lookup. This makes sense on 6502 because each array lookup involves a multiplication plus an extra variable lookup.

The worst case is the first one, which takes 168 jiffies longer to do 2000 lookups with constant index values.

The first case takes almost the same amount of extra time over the second as the third does over the fourth, so converting constants to floating point is consistent.

Edited by Scott Robison
Link to comment
Share on other sites

Well, yesterday was longer than I expected.   Sheesh.    I don't like medical stuff, but its especially annoying if you have a thing that gets bumped several hours at a time because the docs are all in the ER working on a motor vehicle accident.   Oh and "no, you still cannot eat or drink anything but water, sir."    Geez, just cancel my thing and let's reschedule but don't keep me in there extra hours just waiting.    Ah, well,  I guess its medicine in a medium sized city  in 'current year' USA.   I'm very sorry I didn't get back until late, and then had errands and other things today so I had to delay writing this up.   

Before I start..  Scott:   Awesome work on some benchmarking for variable cost/overhead.  One possible addition would be to add a line (before everything else) where you initialize a bunch of dummy scalar variables (DA,DB,DC,DD,DE,DF,DG,DH, etc).   You can use 'DIM' to initialize regular/scalar variables -- just call the command with the variables separated by commas, e.g., 'DIM DA,DB,DC,DD,DE'  -- which should let you fit 20+ dummy variables in one line.     You can REM it out to duplicate the results you already have, and then un-REM the line and see how scalar variable performance changes when the ones in your benchmarking routines effectively become later-initialized scalers and have to do the 'nope, nope' stuff in view of all the others earlier in pecking order....  

Anyway, now to the thought process and the optimization that got the time of the 'Proteus' demo down to 3 minutes and 47 seconds.... 

As I said above, the key was looking at the "precursor" expression that creates the value that goes into the COS() functions in the cosines stack (with the result of all that, in turn, getting tweaked into the 'y' pixel coordinate.   

EDITED:   Yes, I know what I call the 'precursor' is the 'angle' that gets fed to the cosine functions.   But its not an angle, not really, because if the original author wanted to convert degrees to radians (s)he ought to have multiplied it by pi/180 (approx.  0.01745) and the coefficient used was 0.0327 which is not an even multiple of that or anything.  Its just a magic number the original author fiddled with until the output matched what was wanted.   So call it angle if you wish, but its just the 'precursor' to me!     At any rate...

In the "0652" time version of the demo above, that precursor was at the beginning of line 4:


We obfuscated it a bit in prior optimizations by kicking part of the calculation up into the outer loop (J=O*O),  but in essence the expression generates the square-root of the sum of two squares and then multiplies the result by a magic number the original author came up with to get everything to work on the screen the way (s)he wanted.   

The things that are squared are the values of our two loop indexing variables (outer 'O' and inner 'I').    

My brainwave was to look at this and think about a basic mathematical identity:     N squared is equal to  (-N) squared.   A positive times a positive is a positive; a negative times a negative is also a positive.... 

This is helpful because each of the main loop indexing variables swing between negative and positive.    The outer loop runs from -144 to 144 with a step of 2.25.   So there are 129 different values of 'O' as it iterates.   Digging deeper, that step increment is an even divisor into 144, which means the that indexing variable will have 64 negative values, the 0 value, and 64 positive values that are the SAME (save for the change in sign) as the 64 negative values.    But what that means is that the outer loop variable effectively makes only 65 possible different contributions to that square-root of the sum of two squares expression.   Whether 'O' is 144 or -144;   -141.75 or 141.75, the value of 'O' participates in the 'precursor' expression only by being multiplied by itself.   The sign of 'O' is effectively rendered immaterial for THAT part of the calculations in this program.   

Likewise, the inner loop runs from -144 to some varied positive number calculated based on the value of the outer loop indexing variable, with an increment step of 1.  The way the endpoint is calculated, it cannot be higher than +144.    Once again, that means that (keeping in mind the loop also swings through the 0 value while iterating), although the inner loop runs up to 288 iterations, the inner loop indexing variable can only supply 145 different contributions to that 'square root of the sum of two squares' expression...   The sign of the value of 'I' is immaterial to the precursor-to-cosines stack combo, since (-I) * (-I) is the same as I * I.    

Synthesizing all this, we start to see something:   Our inner loop has been calculating the extremely expensive combination of what I've been calling the ' precursor' and that triple cosines stack (and a couple multiplication operations into the bargain)  to get the 'y' coordinate just under 32,000 times over the course of running the program  -- i.e., whenever the more cheaply calculated 'x' coordinate lands within the valid coordinates of the bitmap screen.  HOWEVER, our examination above has just demonstrated the calculations in the "evaluate precursor, feed it to the cosine stack" sequence are always going to be performed on what can effectively be treated as a maximum of 9425 (65 x 145) combinations of inputs.   

It means we might be able to make a lookup table!    A 65x145 table.   

But whether we can implement such a table and what we put in it is limited by memory constraints.  

For example, if it turns out that we MUST store floating point values for this to work, we cannot use a regular array and this idea probably dies.  A float array takes 5 bytes per entry to store floating point values.   So, um, yikes!  That's  9425 * 5 = 47,125 bytes, which is way WAY more than the 39K of memory available to BASIC.  I suppose we could make an artificial fixed point array using POKES/PEEKS  storing two bytes in memory for each value, one for the integer portion and one for the fractional portion.  But frankly, it seems to me using BASIC to calculate, store and fetch both pieces might be too costly since all our 'overhead' would have to be done in BASIC.    (It gives me an idea for a 'USR()' routine however for some future project).  

I didn't bother trying that, because it turns out we can store our 9425 values in an integer array. In C64 style BASIC, that means we have a range of -32768 to 32767 for each value, and it takes only 2 bytes of memory per entry.   That's less than 20K of memory cost here.   We can do that.   

But...  Does it actually help us to only be able to store integers?   To see about this,  let's reason it out.   Looking at the plotting sequence as a whole, it is apparent that the 'y' pixel coordinate is calculated and left (as one of my updates above mentions) as a float and fed to the graphics commands as-is, and the commands simply disregard the fractional portion.   

So can we throw out the fractional portion earlier?  

Let's see.   The final 'y' coordinate for a pixel location is evaluated as:

Y=G - ({n} - T + F)

{n} is the result of that 'precursor' and cosines stack.  'F' is a constant that acts like the "vertical position" knob on an old analog TV -- changing it moves the entire image up or down within the bitmap.   So far, I've kept it at the original author's suggested value of 90.   

'T' is the outer loop index value 'O' divided by the step increment 'B' (constant of 2.25).   Which is to say T ranges from -64 to +64.   OH! LOOK!   Now we see how, despite the sign of the outer loop indexing variable being discarded through calculation of the precursor expression, the sign (pos/neg) of 'O' nevertheless DOES matter and plays into the ultimate 'y' coordinate pixel generation because 'T=O/B' gives different result in terms of sign depending on whether 'O' is positive or negative.   Nice.     

That also tells us that what I refer to as {n} above (the result of the precursor and cosines stack) is the thing we should store in our table, if possible.   

OK, so reasoning further:   The result from the precursor/cosines stack (including the multiplication by the original author's constant 'C') is a float.   It is a small float that we can be sure will not exceed BASIC's 'valid' integer variable range of -32768 to 32767.   We know this because we can see that after the tweaks by subtracting T' (value in range -64 to 64) and adding 'F' (const. of 90), and then taking the result and subtracting it from 'G' (const. 199), it will, most of the time, yield a valid vertical pixel coordinate between 0 and 199.

That's great news.   It means we can throw that {n} into our integer table.   BUT.   Doing so results in disregarding the fractional portion earlier than normal.    Here's what that does and what we need to do to deal with it.   Suppose that prior to using our table (i.e., in the earlier program versions) some {n} value, after getting tweaked by  'T' and 'F' winds up yielding a result of 19.378.    That number then got subtracted from 199 to give a 'y '(vertical) pixel coordinate of 179.622.   Feeding that to our graphics commands resulted in the fractional component being disregarded, meaning a pixel would be plotted at vertical position 179.   

But, now suppose we stick the result of {n} into an integer array variable.   It loses its fractional component right at that moment.    Which means that same value after adjustment by 'T' and 'F' winds up returning  '19' instead of 19.378.   And subtracting 19 from 199 yields a vertical pixel coordinate of 180.    It's off-by-one.   That's the only problem.   We need to subtract an additional '1' from 199 to get the 'correct' pixel coordinate due to the 'premature' removal of the fractional component when we use the table. 

Remember that 'F' constant I just mentioned (the one that's like a 'vertical position' setting)?   We will change it from 90 to 91, and that will do the trick. 

Well, that's it.           

That's how it works.  I'll put up a screen shot of the listing tomorrow.   Also,  I'm going to put up a combined 'final' program in the 'demos' section, probably with a little menu that lets you pick between three versions of this sucker.  I'll try to include one of the very early versions from this thread (if I can find one I saved), the 1.0 version that was the best performer prior to this breakthrough, and then this version '2.0' update.   Having all three in one program will hopefully help folks make a close study of the listing and see how things evolved. 

[EDITED:   Ok, the 3-in-1 demo is in 'demos' in the downloads section]

One final word:     This latest optimization really improved performance, but only by eating literally HALF of all the memory available to BASIC.   That's often a thing... sometimes you can gain a lot of speed by throwing memory at a problem.   And for a free-standing demo like this, hey... whatever works.   Many famous C64 demo scene machine language programs pre-calculate all sorts of stuff using every possible available byte of memory.    But if this were just a small part of another program, it would clearly be a bridge too far.     That said, given what the values in the table look like, I have an idea for an alternative implementation that might be able to get away using only 10K of memory...  but there's no way it would be this fast.        



ETA:     Here's that listing....     




Edited by Snickers11001001
  • Like 1
Link to comment
Share on other sites

9 minutes ago, desertfish said:

Can you paste it in ascii here as well so i can run it easily myself? Thanks!

Here's the newest version 2.0 with the precalc.   Let me know if you want a copy/paste of what I stuck in the downloads -- i.e., the three-in-one combo with the slow, fast, and fastest versions.  Cheers.

1 X=.: Y=.:I=.:G=199:T=.:U=.:F=91:J=.:D=.0327:C=20:O=.:A%=.:A=144:B=2.25
2 E=160:K=A*A:L=.5:DIMQ%(64,A):SCREEN$80:TI$="000000":GOSUB8
6 PRINT"\X97\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11 TIME:"TI$"\X13\X11\X11"TAB(23)"THE PROTEUS DEMO"
9 FORI=.TOA:X=SQR(Y+I*I)*J:Q%(T,I)=U*(COS(X)+COS(X+X)+COS(5*X)):NEXT:X=4*T+27
10 RECTX,C,X+2,27,11:NEXT:SCREEN $80:      RETURN

  • Like 1
Link to comment
Share on other sites

8 hours ago, Snickers11001001 said:

Before I start..  Scott:   Awesome work on some benchmarking for variable cost/overhead.  One possible addition would be to add a line (before everything else) where you initialize a bunch of dummy scalar variables (DA,DB,DC,DD,DE,DF,DG,DH, etc).   You can use 'DIM' to initialize regular/scalar variables just call the command with the variables separated by commas, e.g., 'DIM DA,DB,DC,DD,DE'  -- which should let you fit 20+ dummy variables in one line.     You can REM it out to duplicate the results you already have, and then un-REM the line and see how scalar variable performance changes when the ones in your benchmarking routines effectively become later-initialized scalers and have to do the 'nope, nope' stuff in view of all the others earlier in pecking order....  

Thanks. The numbers from my benchmarking are approximate at best and could stand to do a lot more teasing out of details. The reality is that things like variable lookups are going to vary a lot depending on how many variables exist (as you allude to) since they have to be scanned for sequentially (earlier defs are faster than later defs).

I think the hard part of using DIM to try to measure definition overhead is going to be one of not being able to loop it easily a number of times. With FOR NEXT loops I can easily embed something in the middle to time it, then factor out the amount of overhead time from the FOR NEXT itself to come up with estimates. Once a variable is defined with DIM it can't be redefined, unless CLR is used. But at the point CLR is used I think it would destroy the context of the FOR NEXT loop. {time passes} Yes, that is exactly what happens.

Now, I don't have to use a FOR NEXT loop, I could use PEEK & POKE & GOTO. But I think (and I don't have time to test it right now) that FOR NEXT bookkeeping is able to quickly locate the beginning of the loop again (as evidenced by the fact that I can make the FOR statement the second part of a line:

10 TI$="":FOR I=1TO1000

To try to replace that with a GOTO means that any reverse GOTO has to begin searching from the first line of the program trying to find the intended destination, which means GOTO a higher line number takes more time than a lower line number. I fear the measurement of time (which is already less than perfect) becomes even less reliable at that point as too much overhead would be a part of the measurement.

Of course, this could all be addressed by writing simple little individual programs to measure each portion rather than one monolithic program that does it all. It would probably be better and more reliable. I might do that at some point if someone else doesn't do it before I can get around to it.

Edit: I didn't read carefully enough the first time. You were explicitly addressing the very topic I thought you alluded to. Sorry.

Edited by Scott Robison
Link to comment
Share on other sites


I was browsing trigonometry tricks sites.    This was was actually research for another project.  But in the course of reading through all that stuff, I learned there's a thing called the cosine double-angle identity.   And its possible application here was immediately apparent.  

The identity provides that for any angle 'n', the value of cosine(2*n)  is equivalent to to the value of (2*cosine(n)^2)-1

That's sort of interesting because the cosine stack in my demo has both COS(n) and COS(n+n).    

To implement an optimization based on this, the result of expression I've called the precursor has to be prepared (call it 'n').   Then you get COS(n) and store that, lets say to 'X" [i.e., X=COS(n)];   Then the COSINE stack becomes (X+2*X*X-1+COS(5*n)).    Now there's only 2 COS() operations per cosine stack evaluation, instead of three.  

But the question is:   Is a single cosine operation in that stack so slow that it would potentially save time to replace it with a more convoluted expression that requires an extra variable store, two extra multiplications and several extra variable fetches?   Could all that additional work nevertheless be faster?  

Apparently, yes. 

Implementing the double-angle identity to get rid of one of the COS() calls actually did cut the 'precomputation' part of the revised Proteus Demo down from 1 minute and 39 seconds to 1 minute and 25 seconds, which brought overall time down to 3:32 from 3:47.     Despite increasing the number of 'BASIC' operations, the total cycles for all those added operations still wound up being less than the cycles incurred from a single COS() operation.   

So, sort of interesting, eh? 






  • Like 2
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


  • Create New...

Important Information

Please review our Terms of Use