Jump to content
Jeffrey

New demo uploaded: STNICCC Commander X16 Demo Remake

Recommended Posts

STNICCC Commander X16 Demo Remake

View File

This is the release of a STNICCC Demo Remake for the Commander X16!

I have been (silently) working on this for the last couple of weeks/months. It is time to release it :). Let's just say: the Commander X16 is far more powerful than I had thought! 😉

Here is a video of it running:

Enjoy!

Regards,

Jeffrey

 

---

 

PS. There was an earlier attempt to remake this demo on the X16 (done by Oziphanto on youtube). Oziphanto did a very nice comparison video of the X16 with several other machines of the 8-bit and 16-era:

He also re-created this demo, but (in my opinion) did not do such a good job extracting everything out of the X16: his demo ran in 2:32. The remake I made does it in 1:39! 🙂 🙂 

His benchmark comparison should therefore be updated:

lap_times.png

Keep in mind the Commander X16 only has:
    - An 8-bit 6502 cpu (8MHz)
    - No DMA
    - No Blitter

Yet it keeps up with 16-bit machines like the Amiga! (actually its even faster right now)

---

Extra notes:

- This only works on the x16 emulator with 2MB of RAM
- It uses the original data (but its split into 8kb blocks, so it can fit into banked ram)
- Waaaayyy to much time is spend on the core-loop to make it perform *this* fast!
- My estimate is that it can be improved by another 10-15 seconds (I have a design ready, but it requires a re-write of the core-loop)
- It uses a "stream" of audio-file data and produces 24Khz mono sound (this will not work on the real x16, since loading the files that fast is a feature of the emulator only)

Here is a version without audio (so this should run on a real x16):

And it runs even faster (1:36:90) 😉 


 

  • Like 12

Share this post


Link to post
Share on other sites

Very nice!

I have a few questions:

Which graphic mode do you use? 16c or 256c? The 16c requires less memory and bandwidth but having 2 pixels packed per byte is a bit of a PITA to handle?

I guess the VERA auto increment feature and the chunky pixels helps in out-performing the Atari bitplanes mode?

Share this post


Link to post
Share on other sites

I got it to run on my computer with dual core x86-64 and Windows 10 using: 

..\x16emu_win-r38\x16emu.exe -ram 2048 -run -prg STNICCC.PRG

 

Unfortunately my computer is not that fast so it only run in 50% speed.

 

It said the time was 1:39.96, but it was longer in reality.

Share this post


Link to post
Share on other sites
Posted (edited)
28 minutes ago, DrTypo said:

Very nice!

I have a few questions:

Which graphic mode do you use? 16c or 256c? The 16c requires less memory and bandwidth but having 2 pixels packed per byte is a bit of a PITA to handle?

I guess the VERA auto increment feature and the chunky pixels helps in out-performing the Atari bitplanes mode?

This version uses 16 colors per pixel. The 8-bit version is a little slower, but not by much. The cost of packing 2 pixels per byte is quite high and in 256c mode you can "re-use" colors (by "rotating" the palette) so you don't have to clear the screen that much.

Attached is my (technical) design for the version I just released. It probably requires extra explanation, but it might give you an idea about the structure of the polygon drawing routine.

BTW: like Oziphanto explained: this demo (the original and this one) is drawing 2D-polygons. Its not really doing any real 3D math. So the demo shows how fast you can draw on a machine, not how fast you can do 3D math.

stniccc x16-Inner loop v1.png

Edited by Jeffrey
  • Like 2

Share this post


Link to post
Share on other sites
Posted (edited)
6 minutes ago, mobluse said:

I got it to run on my computer with dual core x86-64 and Windows 10 using: 

..\x16emu_win-r38\x16emu.exe -ram 2048 -run -prg STNICCC.PRG

 

Unfortunately my computer is not that fast so it only run in 50% speed.

 

It said the time was 1:39.96, but it was longer in reality.

Yeah. That time was recorded by hand and hardcoded in the demo 😉

It might help if you run it again: the loading of the audio files can slow it down too. If you run it again, the files are probably in cache.

Edited by Jeffrey

Share this post


Link to post
Share on other sites

When do start measuring the time? Is it from start of program or e.g. from when OXYGENE is shown?

 

It is a bit faster on second run: about 68% of full speed on my computer.

Share this post


Link to post
Share on other sites

When not streaming audio, is the streaming of the polygon draw data realistic to run this fast on actual hardware?

Amazing results btw.

Share this post


Link to post
Share on other sites
Posted (edited)
1 hour ago, desertfish said:

When not streaming audio, is the streaming of the polygon draw data realistic to run this fast on actual hardware?

Amazing results btw.

Yes. When audio is turned off, there is no "streaming" going on during the drawing of the polygons. A speed of 1:36:90 is achievable on real hardware. In fact (if I can find the time to implement the newer version) I believe a time of 1:30 (and probably 1:25) is possible. That's 20 fps!

In essence (when audio is off): the scene files are loaded at the beginning (into banked ram). The loading right now still uses the kernal LOAD-function and loads from the host (not from the simulated SD). So that part is faster than on real hardware. This loader can however be replaced by an SD-loader when you put it in a real X16. I didn't bother to do this (yet). But thats just initial loading time.

The playback is started when all polygon-data is loaded (640kb) into ram. After that there are is no loading going on. Also, I do not "touch" or "prep" the data before starting the playback (thats in the spirit of the competition).

More info about the original scene files and competition can be found here: http://arsantica-online.com/st-niccc-competition/

It's actually pretty crazy how much work the 6502 can do when you really keep improving your design: my first version was around 4 minutes ;). Now it does it so much faster. I can probably make an (instructive) video about what process I went through. 😉

Specifics:

The auto-incrementer from VERA helps quite a lot to speed up the process: it takes only 2 cycles per pixel (= 1 "STA VERA_data0" per 2 pixels) to blit a horizontal line to the buffer. This is where the X16 is faster than other platforms.

Of course: packing 2 pixels into 1 byte (and unsetting/setting the incrementer in the mean time) is slower than other platforms (the setup cost for VERA takes quite some time). This is why (on the X16) there isn't that much time difference between 8-bit pixels vs 4-bit pixels.

As a sidenote: I "shrink/crop" the screen to 256/200 pixels. All polygon data (x and y coordinates) are between 0 and 255 and fit nicely in a byte. This suits an 8-bit cpu very well. But, VERA still uses a 320 pixel-wide screen buffer (even if you only see 256 pixels horizontally) so to determine the vram-address given an x and y is not very "elegant". Lots of work is done to mitigate the problems that arose from that. In this version I have several lookup tables. 

Below is my new design of the core loop btw. Its really nuts. It requires many variants of (slightly) different code. Has very intricate jump-tables (with 64k entries!). Switches banks constantly. Uses two ports of VERA etc (in two different ways). But it should be quite a lot faster! Using everything the X16 has got where it helps.

920056729_stnicccx16-Innerloopv2(advanced).thumb.png.28baa22292205b4d4d36f1c36d228d45.png

Edit: the forum degrades the diagram picture for me. I don't understand why it does that.

Edited by Jeffrey
  • Like 1

Share this post


Link to post
Share on other sites

On my computer (core i7 8700k), on the second execution it does take 1:39 from OXYGENE to the end.

On the first execution there are slow downs.

Yes it would have been nice to have an actual 256 pixels wide screen mode on VERA.

  • Like 1

Share this post


Link to post
Share on other sites
2 hours ago, mobluse said:

When do start measuring the time? Is it from start of program or e.g. from when OXYGENE is shown?

 

It is a bit faster on second run: about 68% of full speed on my computer.

I start measuring time when the 3D part starts (the word "Oxygene" in yellow).

Apperently this really pushes the limits of even the emulator: it has to work quite hard ;).

Share this post


Link to post
Share on other sites

I'm really impressed by this, just wow 😲! A Wipeout style game could be in reach with the cx16. That's so impressive.

  • Like 1

Share this post


Link to post
Share on other sites
36 minutes ago, AndyMt said:

A Wipeout style game could be in reach with the cx16. That's so impressive.

Hmmmm - good point. And it's good the X16 has a sound chip that's not behind the vera ports, so you can have sound w/o injecting a lot of overhead moving the data ports back to the PSG registers. Plus, Atari arcade games used FM sound anyway, so it'd be pretty period-accurate to boot.

 

  • Like 1

Share this post


Link to post
Share on other sites

I understood that there is no 3d calculations involved. But for Wipeout those might not be needed if we accept some limitations. The course would be pre-calculated (as is the demo) and the perspective would be fixed to a virtual observer following the player in the middle of the course.

Tricky would be rendering the player and opponent pods... maybe Spites (ugly)?

Or pre-calculated polygons for each object for let's say 10 different distances (more cpu cycles)?

Share this post


Link to post
Share on other sites

I'd say that using sprites isn't as big a problem if they're drawn in such a way that fits stylistically.

Share this post


Link to post
Share on other sites
Posted (edited)
7 hours ago, Jeffrey said:

I start measuring time when the 3D part starts (the word "Oxygene" in yellow).

I was looking at the audio.asm and noticed that your IRQ handler uses PHA, TXA, PHA, TYA, PHA, and not just direct PHA PHX PHY - not that 4 cycles per frame is going to make a huge difference, but I think the initial portion of the ROM handler has already done the PHA PHX PHY before doing the JSR ($0314) so you can get away without even doing that, and the Kernal pops them back as well, so you don't need to spend the cycles saving the CPU registers at all - unless I'm missing something. Also, if not all of the Kernal's per-VBLANK routines are needed, then you could just JSR the ones you need in your own handler and not JMP to the Kernal's handler at all (for instance, you could skip the KBD/Joystick polling for a decent number of clock cycles back.)

Not sure how these savings would stack up in the grand scheme, but it might end up being several frames of time over the course of the entire demo run.

Edit: Although, I don't know if giving the main program a few hundred extra cycles would make any difference. I just thought I'd mention some potential "free" savings by disabling kernal routines that you don't need in case it helps. Awesome job, man!

Edited by ZeroByte
  • Like 2

Share this post


Link to post
Share on other sites

That is amazing!


Sent from my iPhone using Tapatalk

  • Like 1

Share this post


Link to post
Share on other sites
On 5/21/2021 at 8:13 AM, DrTypo said:

On my computer (core i7 8700k), on the second execution it does take 1:39 from OXYGENE to the end.

On the first execution there are slow downs.

Yes it would have been nice to have an actual 256 pixels wide screen mode on VERA.

There can be a sort of 256 pixel wide mode. 255 pixels actually,  if the HSCALE is set to $33. Just use 8x8 tiles and set the tile map to 32 tiles wide. Admittedly this isn't a bitmap mode and would take some fiddling but it can be done. 

Share this post


Link to post
Share on other sites

Very impressive! Coming from someone who wrote a Gameboy Color port of it. (spoiler: it took 3:32 to finish due to its horrible VRAM layout and access)

  • Like 1

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...

Important Information

Please review our Terms of Use