Jump to content
Jeffrey

New demo uploaded: Wolfenstein 3D - raycasting demo with textures

Recommended Posts

Posted (edited)
24 minutes ago, Ed Minchau said:

If you do exactly like wolfenstein you'll have the same limitations with an 8bit machine, yeah. But there's shortcuts that they didn't try until Doom like binary space partition that could help. 

Yes. Those sort of shortcuts will help (although I am starting now with how Wolf 3D did the raycastings).

A more fundamental problem is the cycle budget: blitting 304*152 pixels using 8 cycles per pixel takes around 370k cycles. Per vsync tick we only have 133k cycles available. So 60fps seems out of reach for the x16.

Thats without doing any raycasting/space partitioning, sprite scaling, AI etc.

But thats ok. Im fine with 10-15fps. 🙂

Edited by Jeffrey
  • Like 1

Share this post


Link to post
Share on other sites
Posted (edited)

Its starting to work in assembly:

raycasting_with_textures.gif

Right now its about 5fps. Can be improved quite a bit. I'll have to do some cleaning up and I would like to add some small features (maybe more than one texture). Then I'll probably release a new version.

Have fun! 🙂

Edited by Jeffrey
  • Like 5

Share this post


Link to post
Share on other sites
12 hours ago, Jeffrey said:

Yes. Those sort of shortcuts will help (although I am starting now with how Wolf 3D did the raycastings).

A more fundamental problem is the cycle budget: blitting 304*152 pixels using 8 cycles per pixel takes around 370k cycles. Per vsync tick we only have 133k cycles available. So 60fps seems out of reach for the x16.

Thats without doing any raycasting/space partitioning, sprite scaling, AI etc.

But thats ok. Im fine with 10-15fps. 🙂

I did a video demo about 18 months ago that did 10fps, and I thought it looked OK. My upcoming demo is 20 fps but it's 160x68, and it looks great. Anything in that 10-12-15 fps range should be good. 

If you could get away with a vertical stretch, 304x96x8 = 233472, so maybe 20 fps is possible?

Share this post


Link to post
Share on other sites

If you make the textures mirrored across the Y axis you can cut the required texture samples in half again, and just mirror the bottom half from the upper half of the screen.

Don't know if you wanna do this because it may start looking really ugly if the textures are constrained this way perhaps....

Also for non-wall textures (monster sprites?) this surely is a nasty restriction

Share this post


Link to post
Share on other sites
4 hours ago, desertfish said:

If you make the textures mirrored across the Y axis you can cut the required texture samples in half again, and just mirror the bottom half from the upper half of the screen.

Don't know if you wanna do this because it may start looking really ugly if the textures are constrained this way perhaps....

Also for non-wall textures (monster sprites?) this surely is a nasty restriction

He's using the wolfenstein 3d textures; unfortunately they don't mirror either vertically or horizontally. 

Share this post


Link to post
Share on other sites
Posted (edited)

What I am noticing is that the raytracing itself (the dda algorithm) is right now around twice as expensive as the blitting to the screen. So halfing vertically won't give you much speed improvement. Halving horizontally would certainly help.

What is more interesting right now is how to do the dda-algorithm quickly in assembly. Right now, it does for each ray:

  • a tan() and inverse-tan() lookup resulting in two 16-bit values (x_step and y_step)
  • two multiplications for determining the initial intersection points inside the cell you are standing (x_intercept and y_intercept)
  • quite a lot of branches to implement the logic used by the dda-algorithm (including copying 16-bit numbers)
  • several decrementers, incrementers, subtractions and additions of 16 bit numbers
  • bit-shifters to do a lookup in the world-map table(s)
  • two multiplications of an 16-bit and 8 bit value (using x, y distance and cos/sin to get to the distance from the camera plane)
  • a divide of a 16-bit value (the distance to the wall) by a 16-bit constant resulting in a wall height (16-bit) --> expensive! (want to use lookup tables)
  • a capping of the wall height (16-bit) into a render height (byte)
  • lots a small little details

For setting up the rays I also change the input so that I only have to do the logic for one quadrant.

My gut feeling is that the above should take (maybe) several hunderds of cycles. Maybe 300-400? So 304 rays * 300-400 cycles = 90,000 to 120,000 cycles. So maybe 1 tick. Yet it is spending about 7-8 ticks now. So much room for improvement I think.

Basicly I implement the logic described in this video:

It would be cool if we could iterate together by suggesting / showing to each other what example assembly snippets would be faster in order to bring down the cycle count needed for this algorithm. 🙂

First I have to release though. So back to doing some (much needed) cleanup again 😉 

Edited by Jeffrey

Share this post


Link to post
Share on other sites

You have per height routines, but could you also have a per texture+height routine? You could take advantage of textures created with vertical runs of identical pixels to reduce “reads” from the texture. I’d imagine the answer is that that would use a ton of memory, but I’m wondering how much memory the routines are using up now?


Sent from my iPhone using Tapatalk

Share this post


Link to post
Share on other sites
Posted (edited)
48 minutes ago, izb said:

You have per height routines, but could you also have a per texture+height routine? You could take advantage of textures created with vertical runs of identical pixels to reduce “reads” from the texture. I’d imagine the answer is that that would use a ton of memory, but I’m wondering how much memory the routines are using up now?


Sent from my iPhone using Tapatalk

Thanks. That would be too much ram usage. Right now the longest routine is 64 reads and 182 writes. So thats (64+182)*3 bytes. So I reserve 1kb per routine now (for fast access to the routines). That covers all of banked ram right now. I can probably pack it better, but 512 roitines times the amount of textures would be way too much memory.

Another way to reduce (dummy) reads is to have (smaller) textures in normal memory (but "striped" pixel by pixel) and simply hard code the needed read addresses (with an x-index for the texture index) in each routine for all walls smaller than the texture height.

For example:

LDA $5603, X

STA VERA_DATA0

LDA $5719, X

STA VERA_DATA0

...

Where X contains the texture index.

Edited by Jeffrey

Share this post


Link to post
Share on other sites

Well, if you're going to squash vertical pixels, you may as well squash the horizontal resolution the same amount, scale it back up with VERA, and use the extra space in the single screen bitmap as double buffering space.

I think even a 286 couldn't play Wolf3d at full resolution / FPS.

Share this post


Link to post
Share on other sites

Looks great! 

I'm curious if you are doing any culling when you render out the world?

  • Like 1

Share this post


Link to post
Share on other sites

I don't think a raycaster needs special culling because you only render the exact number of vertical pixel columns with the first wall (texture) that is hit with the view ray.  So any walls 'behind' others are never seen by the algorithm.

  • Like 1

Share this post


Link to post
Share on other sites

It's a shame that OPL and OPM are totally different animals, otherwise the AdLib sound is also right there for the playing, too. I DL'd the Wolf3d source to look into that possibility, but haven't come up with any ideas for how to convert between the two on the fly.

Share this post


Link to post
Share on other sites
On 3/16/2021 at 11:57 AM, ZeroByte said:

Well, if you're going to squash vertical pixels, you may as well squash the horizontal resolution the same amount, scale it back up with VERA, and use the extra space in the single screen bitmap as double buffering space.

I think even a 286 couldn't play Wolf3d at full resolution / FPS.

I was thinking about that too.  If the Vera HScale is $33, that would show 256 columns of pixels, and a Vscale of $22 would show 128 rows; so that's 32 kb for a screen at 8bpp. A tile map could just be 32x32 of 8x8 tiles (so 1024 tiles) and give two full screens and can wait for VSYNC to flip screens.  I'd use $08000 to $17FFF for the tile data and move the layer 1 tile data down to $03800 and the layer 0 tile map at $03000.

The tiles would be arranged in columns so the first column would be tiles 00 01 02 03 etc; the increment for a column would be 8 instead of 320.  There would just need to be a couple of lookup tables for the low byte/high byte of the first pixel in a column to initialize the VRAM address pointer; maybe two such sets of lookup tables for the two screens.

The old Wolfenstein raycasting method could be improved by borrowing the binary space partition idea from Doom.  Anything that can be done to reduce the computational overhead for finding which column of pixels to show at what height value will speed this up a lot.

  • Like 1

Share this post


Link to post
Share on other sites
Posted (edited)

Another thing I was thinking about was the Wolf3D textures.  The originals are 64 x 64 bitmaps.  I just imported that PNG into Paint, resized it horizontally by 50% and then resized again by 200% and saved as a 256 color bitmap, attached.  Now the images are all still 64 x 64 but every second column is identical. and this can be resized horizontally to give 32 x 64 images (the second attached image), doubling the number of textures that can be stored in VERA at one time.

edited to add: tried again squashing the images vertically instead, and it looks better (wolf3dwalls3.bmp)

one more: tried again in IrfanView, which resized and resampled using better methods than Paint.  wolf3dwalls5 is squashed vertically by 50%, and 5a is the same file stretched back out to 64 pixels per image again to give a comparison to the original images.  So, the drawing routines could be smaller (LDA a maximum of 32 times for a wall column rather than 64, plus one LDA for the ceiling and another for the floor, and only 128 STA instructions if Vscale is $22) and hence faster.

wolf3dwalls.bmp wolf3dwalls2.bmp

wolf3dwalls3.bmp

wolf3dwalls5.bmp wolf3dwalls5a.bmp

Edited by Ed Minchau

Share this post


Link to post
Share on other sites
Posted (edited)

OK, tried once again with IrfanView, but this time I decreased the color depth to 256 colors first and then did the resizing.  wolf3dwalls6 is squashed vertically to 32 pixels high per texture, and 6a is the same image with every row of pixels doubled.  There's not much difference visually between the original image and 6a.

 

Edit: oops. That didn't work any better nevermind.

wolf3dwalls6.bmp wolf3dwalls6a.bmp

Edited by Ed Minchau

Share this post


Link to post
Share on other sites

OK, I think this is the best one yet, the wolf3d textures at half height in 256 colors.  Using these would increase the speed and double the number of textures stored in VERA at once, at the expense of making everything kinda blurry in the y axis.

wolf3dwallsA2.bmp

Share this post


Link to post
Share on other sites

Hi Jeffrey. This is impressive and congratulations on your achievement. It must have been quite a ride for you to achieve this result. Kudos!

  • Like 1

Share this post


Link to post
Share on other sites

I have nothing to add to this 😀, I am just here to say it is very impressive what you achieved so far and I am looking forward to more progress.

  • Like 1

Share this post


Link to post
Share on other sites
On 3/17/2021 at 3:23 PM, ZeroByte said:

It's a shame that OPL and OPM are totally different animals, otherwise the AdLib sound is also right there for the playing, too. I DL'd the Wolf3d source to look into that possibility, but haven't come up with any ideas for how to convert between the two on the fly.

Where did you find the originel sound files? Do hou have and tips to rest about these kinds of FM systems/chips? I sound probably use some help on the sound/music front 😉 

Share this post


Link to post
Share on other sites
Posted (edited)
11 hours ago, Jeffrey said:

Where did you find the originel sound files? Do hou have and tips to rest about these kinds of FM systems/chips? I sound probably use some help on the sound/music front 😉 

I didn't get all that deep into it but I did get to where I was reading some portion - I can't recall whether it was the high-level calls from the main loop, or the low-level stuff that actually wrote data to the chip - lol. I am actually interested in digging deeper on that front. I've done some research on OPL, and the good news is that it's probably easier to translate OPL -> OPM than going the other way, because OPL was only a 2-operator synth while the OPM has 4 - but the OPL has more voices, and can work in a mode where many of them are drums - don't know which way Wolf3d worked, but I'm thinking they probably used the 9-voice mode because there wasn't a lot of percussion in the music as I recall, and they also used FM voices for SFX in addition to music.

I suspect that the SFX data files themselves live in the game's MAP files or other data files - they're obviously not going to be in the source.  So I just downloaded the shareware installer and am about to go take a gander.

Interstingly enough for the record, the stated system requirements for Wolf3d are 286/12 with 570K RAM

Edited by ZeroByte
fixed typo (I had CPU required as 281 - haha)
  • Like 1

Share this post


Link to post
Share on other sites

I don't really know how to use it, but have you thought about using the floating point library in the kernel to do multiply with fraction part? It's possible it might be faster than your implementation.  Although it might involve costly conversions which would balance it out anyway... (I honestly have no idea, I haven't done any research into it, just throwing it out there).

Share this post


Link to post
Share on other sites
Posted (edited)
1 hour ago, Ender said:

I don't really know how to use it, but have you thought about using the floating point library in the kernel to do multiply with fraction part? It's possible it might be faster than your implementation.  Although it might involve costly conversions which would balance it out anyway... (I honestly have no idea, I haven't done any research into it, just throwing it out there).

Thanks for the idea. 🙂

I'm pretty sure though that doing floating point math on a 6502 CPU (for this purpose) is slower than doing fixed point processing. The 6502 is not well optimized for doing floating point math.

Floating point numbers are (in general) probably easier to deal with, since handling fixed point 8 or 16 bit numbers means a lot off fiddling to make everything fit and not "break". Floating point numbers have a real advantage when it comes to conveniency. And if the CPU (or GPU) is hardware-optimized for it, its most definitely the better solution.

Edited by Jeffrey

Share this post


Link to post
Share on other sites
Posted (edited)

Short update

The latest release (version 1.2.0) ran at 5 fps.

My local version now runs at 7.5 fps 😀😀 Around a 50% gain in speed.

I'm currently optimizing the dda-algorithm. So far I did two things:

Still more speed to be gained 🙂

Edited by Jeffrey
  • Like 1

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...

Important Information

Please review our Terms of Use