Jump to content
Jeffrey

New demo uploaded: Wolfenstein 3D - raycasting demo with textures

Recommended Posts

14 minutes ago, Ed Minchau said:

In the original game the bad guys/ projectiles were all sprites.  That's going to take some trickery to make it work here because we can't resize sprites.

No, but the sprites did not have a ton of animation to them, so you can easily store sprites at different sizes and switch between them when they approach. Individual sprites can be up to 64x64 and you can easily use multiple sprites for the same figure if they need to be bigger.

  • Like 1

Share this post


Link to post
Share on other sites
5 hours ago, desertfish said:

won't this break when you add non-wall elements such as monsters, projectiles

Z-buffer each pixel column, then draw the sprites on the other BG layer (assuming there's enough VRAM)?

Share this post


Link to post
Share on other sites
4 hours ago, SlithyMatt said:

No, but the sprites did not have a ton of animation to them

I read through that part of the source code last weekend - it's interesting how the actors are done - the animations are part of the state machine logic - most states only have 2 or 3 frames of animation.

Share this post


Link to post
Share on other sites
Posted (edited)
14 hours ago, ZeroByte said:

This algorithm sounds pretty good. I've been doing a little thinking about it myself, and perhaps a recursive algorithm is in order. Essentially you call Render(MinX, MaxX, MyRay) and it works as follows:

  • Interpolate the leftmost column (TestCol) of the face the ray hit, clamp TestCol to MinX and cast a ray (testRay) at TestCol
    • If TestCol hits the same face,
      • render from that column to MyRayX.
      • new range is MinX .. TestCol
      • Cast a new TestRay at the center of the new range
      • call Render with the new range and testRay
    • Else call Render with MinX=MinX, MaxX = MyRay.X, TestRay
      • Render returns the rightmost column it rendered - set MinX = returned column+1
      • (it is assumed that everything from old MinX to MyRay.X has now been completely rendered.
  • Do the same thing going rightwards from MyRay.X
    • i.e. if the right end of the face is visible, render from MyRay to that column, and recursive call to render on the remaining rightmost space, else render the rightmost space, then render from MyRay.X to the leftmost column not rendered.

I think the hidden challenge in this method is going to be in providing stack-like behavior for the MinX/MaxX, MyRay values so they don't get lost during the call to yourself. Another optimization would probably be to choose some minimum number of columns where it's faster to just brute force that left to right, and just do it if MinX/MinY are smaller than this threshold.

This algorithm should be very fast when walls are close to the camera, and might be slower when there's a lot of clutter in the wall topology.

I think I laid out something very similar, except instead of calculating the center of the new range every time, I'd have a precalculated list of which pixel columns to do in which order.  You wouldn't be calculating the leftmost column of the face, you'd be interpolating rays between the rays you'd already cast.  The column height or distance (however you want to store it, same thing really) is just interpolated between two rays that happen to be on the same face.  Likewise, choosing which column of the image that corresponds to (or the xintercept/yintercept, same thing really) is an interpolation between the intercepts/columns.  You wouldn't need recursion, just a way of seeing whether the next one in the list has already been mooted by interpolation, and the high byte of the distance variable indicates that.

Basically, if you have two rays that hit the same wall, then all the rays between those two are calculated with one multiplication and one addition for the image column choice, and then one multiplication and one addition for the distance. It's pretty much the same number of cycles as you need for just the final step of a raycast; converting the distance into a distance relative to the direction the player faces uses two multiplications, lookups to sin and cosine tables, and an addition.  It would indeed be fast, if you were very close to a wall you might only cast five or six rays and just interpolate all the rest.  But even if there's lots of walls in view, you'd still probably only have to cast 30 or 40 rays total before the entire screen was filled in.

The fastest you can push the data to VERA for the whole image is about 260000 cycles, so 30fps wouldn't leave any time to calculate anything else.  20 fps is 400000 cycles though, and I think you'll be able to hit that easily with this method, with time left over to calculate bad guys and pew pews and sound effects etc.

Edited by Ed Minchau

Share this post


Link to post
Share on other sites
Posted (edited)

I wanted to illustrate what this shortcut does, so I took the starting screen from Jeffrey's demo and marked it up a bit in Paint.  First, there would be four rays cast a using the normal method, marked here in red in the first file below.  A is column 0, B is column 256, C is column 288, and D is column 304.  Note that the display is only columns 0 to 303 so 304 is not actually displayed, just used for interpolation if possible.

The second image shows where the interpolation happens.  Ray E is column 128.  First, we check the high byte of the distance value array corresponding to this column; these will all be set to FF at the start of a screen and only changed if a ray is calculated.  So in this case column E has not already been calculated.  We look to the left and right, rays A and B, and see if they point to the same wall.  They do not, so we cast the ray for E and move on to the next one, ray F in column 64.

Once again we check if this ray is already calculated, nope, and then check to the left and right, A and E.  These two are not on the same wall, so we cast ray F and move on to ray G.  Ray G, in column 192, has not been calculated yet (its distance high byte is still FF) so it checks to the left and right, rays E and B.  These are not pointing to the same wall, so we cast ray G and move on to ray H.

Ray H is in column 32.  We see that it has not been calculated yet, so we check to the left and right, columns A and F.  These two rays are indeed pointing to the same wall, so we can interpolate all the rays in between A and F.  Now if we go further down the list and find ourselves at a column that has already been interpolated, such as column 16, the high byte of its distance will no longer be FF, and that entry in the list is skipped.

These interpolated pixel columns are marked in yellow in the second image.  Eventually ray Y (column 272) will be encountered, and comparing B and C shows that they are on the same wall so columns 257-287 are all interpolated.  And when Z comes up on the list, all columns from 289-303 would be interpolated.

The third image shows all 39 columns that would have to be raycast as red vertical lines, and the 266 interpolated columns are shown in yellow across the bottom - an 87% reduction in the number of rays cast.

Edited to add: this is probably pretty representative of an average.  305/(log(305)/log(2)) ~ 37

startscreensmall2.bmp startscreensmall3.bmp startscreensmall4.bmp

Edited by Ed Minchau
  • Like 2

Share this post


Link to post
Share on other sites
Posted (edited)
6 hours ago, Ed Minchau said:

Basically, if you have two rays that hit the same wall, then all the rays between those two are calculated with one multiplication and one addition for the image column choice

My method was based on yours - I just wanted to think of a way that would generate less rays to keep track of. Plus calculating the center of the new range is trivial (min + max) >> 1

Your illustrations and estimated code complexity are quite interesting. It makes a lot of sense when you see it illustrated like that. The ultimate goal is to cast as few rays as possible.

Edited by ZeroByte
changed reply - I gather that "same wall" is implied to mean "same face of same tile" and not any tile in the same plane as the first one.

Share this post


Link to post
Share on other sites
6 hours ago, Ed Minchau said:

20 fps is 400000 cycles though, and I think you'll be able to hit that easily with this method

And I think that would still exceed the performance of the original Wolf3D. That was more like 15 fps, at best.

Share this post


Link to post
Share on other sites
6 hours ago, Ed Minchau said:

The fastest you can push the data to VERA for the whole image is about 260000 cycles, so 30fps wouldn't leave any time to calculate anything else.  20 fps is 400000 cycles though, and I think you'll be able to hit that easily with this method, with time left over to calculate bad guys and pew pews and sound effects etc.

Gotta have those pewpews!

And regarding sound - the PCM SFX are going to be "more expensive" for X16 than PC because the Soundblaster used DMA - the Wolf3d source just sets up DMA in the SB and lets it rip - but that does mean there's only ever one digital sound playing at a given time - so when the SS guard laughs "Hoo Haw haw!" - and you start shooting before he finishes laughing, the laugh gets cut off by the gunshot.

Share this post


Link to post
Share on other sites
2 hours ago, ZeroByte said:

I gather that "same wall" is implied to mean "same face of same tile" and not any tile in the same plane as the first one.

Yes.  You'd have to keep track of the map block hit and the side of the block; two rays that hit the same block on different faces can't  be interpolated between. 

Share this post


Link to post
Share on other sites
16 hours ago, ZeroByte said:

Gotta have those pewpews!

And regarding sound - the PCM SFX are going to be "more expensive" for X16 than PC because the Soundblaster used DMA - the Wolf3d source just sets up DMA in the SB and lets it rip - but that does mean there's only ever one digital sound playing at a given time - so when the SS guard laughs "Hoo Haw haw!" - and you start shooting before he finishes laughing, the laugh gets cut off by the gunshot.

YM2151 pew-pews are a lot cheaper. SS guard laughing would still be PCM, though.

Share this post


Link to post
Share on other sites
Posted (edited)
8 hours ago, BruceMcF said:

YM2151 pew-pews are a lot cheaper. SS guard laughing would still be PCM, though.

Wolf3d used a combination of OPL and Digitized SFX. The sound files actually contain all SFX in 3 versions each: PC Speaker / Adlib / Digi and the playback routine chooses from the 3 based on which modes are available. So there's an OPL noise that equates to the SS guard laughing (it's just a kind of alarming ding sound though). Some SFX didn't have digital versions at all, such as the sounds it makes when you collect treasures / ammo / weapons, etc.

Since the OPL has 9 voices, Id reserved voice 0 for Adlib SFX and the music just didn't use that channel. The YM2151 only has 8 voices, which works out great for the music playback, but the Adlib SFX would either have to be multiplexed in by the engine, or else maybe the PC Speaker SFX equivalents could be used exclusively but emulated on the PSG. Gotta admit tho - the digital SFX really make a huge difference in the experience, so I'd still try to include those if I were doing this project. Not sure how much overhead it's going to take just to keep that PCM stream going. Like I said in my earlier post - the SB used DMA to play them back but the X16 doesn't have that luxury so it remains to be seen if enough CPU time is going to be available for PCM SFX at all.

Fortunately, though, which SFX are available in which formats is hard wired in the source code, so a port could pick and choose which SFX to ignore the PCM versions if you wanted to keep that to a minimum - but I think if you include any at all, you may as well include them all because whenever those SFX -do- play, you're going to get slowdown then.... it'd be something to play around with.

Edited by ZeroByte
  • Like 1

Share this post


Link to post
Share on other sites

okay - I finally hit a milestone with my IMF to YM2151 conversion experiment. After much banging around in registers and tracking down mistakes with bit shifting, etc, I have gotten "first light" - the driver produces sound and the voices are actually closer to the original than I would've guessed. I had to bootstrap it by just pre-loading a single pitch on all voices, so it's not actually playing the tune yet because I haven't implemented translating the frequency from OPL-speak to OPM-speak. (they're VERY different). Hopefully I can have that done tonight and post a video of it doing its thing - then it will be time to start improving the fidelity, using the LFO, etc. I plan to use the VERA PSG to do the percussion channels but that's going to come later.

As someone else said in some other thread here recently - debugging audio sucks! It doesn't work until suddenly it does - lol.

  • Like 2
  • Thanks 1

Share this post


Link to post
Share on other sites
On 3/25/2021 at 9:03 PM, Jeffrey said:

FYI: the next couple of days I will be very busy IRL 😉

I've been looking through your code; I think I understand it enough to write some tables and subroutines for you.  One thing I noticed in ray.h:

Quote

extern u8 _banked_ram[8096];

That should be 8192, no?

Anyhow, I'll generate the data tables and subroutines needed for the interpolation and will post them here soon.

Share this post


Link to post
Share on other sites

The OPL->YM2151 playback driver is a success!

It has some tweaking to be done, and I ended up cheating and generating a LUT for the YM equivalents of the OPL frequencies because I just couldn't think of a way to do the math in the X16 when I can't use floats or longs. You have to multiply two 16-bit values together and divide by 2^(20-x) - the OUTPUT of this is 16-bit but the process requires up to 32 bits. And concert pitches are almost all floats (226.84Hz, etc). So PHP to the rescue - loop through the entire possible frequencies the OPL can make and have a LUT that returns the YM equivalents. It could be better, as the YM can pitch shift between notes too, so this program is not a 100% OPL emulation, and actually can never be, as the OPL has some strange waveforms.

The Wolf3d tune comes out sounding pretty damned close though - except that the emu still uses 4Mhz for the YM clock so the pitch is wrong.

I'm going to touch a few things up tomorrow and post the results.

  • Like 4

Share this post


Link to post
Share on other sites

@Jeffrey I'm not just looking at your source code, I'm also looking at the compiled code with the META/L editor.  There is a lot of room for improvement in that code; although macros make things easier to read and C makes it easier to write, the compiler produces much longer code than is necessary.  For instance, there's a lot of places where the code reads

LDX #00

LDA #22

STAA VERA_DAT_0

LDX #00

LDA #22

STAA VERA_DAT_0

over and over again.  The second and subsequent LDX and LDA instructions aren't necessary because the value being sent to VERA isn't changing, and each adds two cycles.  The sequences like that aren't all the same, some of them include unnecessary LDX#00 and LDY#08 and LDA ($22),Y instructions over and over, or some similar things.  A few cycles here, a few cycles there, repeated hundreds of times per column of pixels and it really adds up.  If this was all optimized assembly code, then a target FPS of 15 to match the original is definitely achievable.

 

Share this post


Link to post
Share on other sites
Posted (edited)
1 hour ago, Ed Minchau said:

@Jeffrey I'm not just looking at your source code, I'm also looking at the compiled code with the META/L editor.  There is a lot of room for improvement in that code; although macros make things easier to read and C makes it easier to write, the compiler produces much longer code than is necessary.  For instance, there's a lot of places where the code reads

LDX #00

LDA #22

STAA VERA_DAT_0

LDX #00

LDA #22

STAA VERA_DAT_0

over and over again.  The second and subsequent LDX and LDA instructions aren't necessary because the value being sent to VERA isn't changing, and each adds two cycles.  The sequences like that aren't all the same, some of them include unnecessary LDX#00 and LDY#08 and LDA ($22),Y instructions over and over, or some similar things.  A few cycles here, a few cycles there, repeated hundreds of times per column of pixels and it really adds up.  If this was all optimized assembly code, then a target FPS of 15 to match the original is definitely achievable.

 

I think you are looking at the compiled C code. I have a assembly version in the .asm which contains the asm version. The c version is just for testing purposes.

Edit: in fact: I call the asm version from c in different ways (for easier debugging). And some of the c-code I didn't convert yet to assembly because its not very performance critical atm (like drawing the menu once or clearing the render part once).

Edited by Jeffrey

Share this post


Link to post
Share on other sites
2 hours ago, Jeffrey said:

I think you are looking at the compiled C code. I have a assembly version in the .asm which contains the asm version. The c version is just for testing purposes.

Edit: in fact: I call the asm version from c in different ways (for easier debugging). And some of the c-code I didn't convert yet to assembly because its not very performance critical atm (like drawing the menu once or clearing the render part once).

Yeah that makes sense. I'll keep digging. BTW I generated the lookup tables for interpolation last night, that was the easy part. I should have the code done in a day or so.

Share this post


Link to post
Share on other sites
Posted (edited)
On 3/28/2021 at 12:06 AM, Ed Minchau said:

Anyhow, I'll generate the data tables and subroutines needed for the interpolation and will post them here soon.

OK, so I got the data tables generated; this will all go in ray.h 

Quote

// interpolation tables
//
// first the ray to try; the first 4 are always cast

extern i16 _tryray[] = {
0,256,288,304,128,64,192,32,96,160,224,16,48,80,112,144,
176,208,240,272,8,24,40,56,72,88,104,120,136,152,168,184,
200,216,232,248,264,280,296,4,12,20,28,36,44,52,60,68,
76,84,92,100,108,116,124,132,140,148,156,164,172,180,188,196,
204,212,220,228,236,244,252,260,268,276,284,292,300,2,6,10,
14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,
78,82,86,90,94,98,102,106,110,114,118,122,126,130,134,138,
142,146,150,154,158,162,166,170,174,178,182,186,190,194,198,202,
206,210,214,218,222,226,230,234,238,242,246,250,254,258,262,266,
270,274,278,282,286,290,294,298,302,1,3,5,7,9,11,13,
15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,
47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,
79,81,83,85,87,89,91,93,95,97,99,101,103,105,107,109,
111,113,115,117,119,121,123,125,127,129,131,133,135,137,139,141,
143,145,147,149,151,153,155,157,159,161,163,165,167,169,171,173,
175,177,179,181,183,185,187,189,191,193,195,197,199,201,203,205,
207,209,211,213,215,217,219,221,223,225,227,229,231,233,235,237,
239,241,243,245,247,249,251,253,255,257,259,261,263,265,267,269,
271,273,275,277,279,281,283,285,287,289,291,293,295,297,299,301,303
};


// the ray previously calculated to the left of the ray being tried

extern i16 _leftray[] = {
32767,32767,32767,32767,0,0,128,0,64,128,192,0,32,64,96,128,
160,192,224,256,0,16,32,48,64,80,96,112,128,144,160,176,
192,208,224,240,256,272,288,0,8,16,24,32,40,48,56,64,
72,80,88,96,104,112,120,128,136,144,152,160,168,176,184,192,
200,208,216,224,232,240,248,256,264,272,280,288,296,0,4,8,
12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,
76,80,84,88,92,96,100,104,108,112,116,120,124,128,132,136,
140,144,148,152,156,160,164,168,172,176,180,184,188,192,196,200,
204,208,212,216,220,224,228,232,236,240,244,248,252,256,260,264,
268,272,276,280,284,288,292,296,300,0,2,4,6,8,10,12,
14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,
46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,
78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,
110,112,114,116,118,120,122,124,126,128,130,132,134,136,138,140,
142,144,146,148,150,152,154,156,158,160,162,164,166,168,170,172,
174,176,178,180,182,184,186,188,190,192,194,196,198,200,202,204,
206,208,210,212,214,216,218,220,222,224,226,228,230,232,234,236,
238,240,242,244,246,248,250,252,254,256,258,260,262,264,266,268,
270,272,274,276,278,280,282,284,286,288,290,292,294,296,298,300,302
};


// the ray previously calculated to the right of the ray being tried

extern i16 _rightray[] = {
32767,32767,32767,32767,256,128,256,64,128,192,256,32,64,96,128,160,
192,224,256,288,16,32,48,64,80,96,112,128,144,160,176,192,
208,224,240,256,272,288,304,8,16,24,32,40,48,56,64,72,
80,88,96,104,112,120,128,136,144,152,160,168,176,184,192,200,
208,216,224,232,240,248,256,264,272,280,288,296,304,4,8,12,
16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,
80,84,88,92,96,100,104,108,112,116,120,124,128,132,136,140,
144,148,152,156,160,164,168,172,176,180,184,188,192,196,200,204,
208,212,216,220,224,228,232,236,240,244,248,252,256,260,264,268,
272,276,280,284,288,292,296,300,304,2,4,6,8,10,12,14,
16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,
48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,
80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,
112,114,116,118,120,122,124,126,128,130,132,134,136,138,140,142,
144,146,148,150,152,154,156,158,160,162,164,166,168,170,172,174,
176,178,180,182,184,186,188,190,192,194,196,198,200,202,204,206,
208,210,212,214,216,218,220,222,224,226,228,230,232,234,236,238,
240,242,244,246,248,250,252,254,256,258,260,262,264,266,268,270,
272,274,276,278,280,282,284,286,288,290,292,294,296,298,300,302,304
};


// if the above two rays are on the same map block and face, then this table

// is the number of rays to interpolate +1 ;  also rightray minus leftray

//in this case 1 indicates 255 rays, 0 is no
// interpolation.  This table is also the starting point for the interfrac
// table.  If you are interpolating 127 rays, you start at position 128
// on the interfrac table; if you are interpolating 31 rays you start at
// position 32 on the interfrac table

extern i16 _interpolnum[] = {
0,0,0,0,1,128,128,64,64,64,64,32,32,32,32,32,
32,32,32,32,16,16,16,16,16,16,16,16,16,16,16,16,
16,16,16,16,16,16,16,8,8,8,8,8,8,8,8,8,
8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,
8,8,8,8,8,8,8,8,8,8,8,8,8,4,4,4,
4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
4,4,4,4,4,4,4,4,4,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
};

// the different interpolation routines use these fractions; there are
// actually seven tables of fractions in this one page of RAM
// Note that if you are only interpolating one ray (indicated by a 2 in
// the interpolnum table) then you don't need to use this fraction table,
// as the results will just be the average of the leftray and rightray
// parameters.  If you're interpolating 255 values you also don't need 
// this fraction table, as the column number itself would be the fraction


extern u8 _interfrac[]={
0,0,0,128,0,64,128,192,0,32,64,96,128,160,192,224,
0,16,32,48,64,80,96,112,128,144,160,176,192,208,224,240,
0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,
128,136,144,152,160,168,176,184,192,200,208,216,224,232,240,248,
0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,
64,68,72,76,80,84,88,92,96,100,104,108,112,116,120,124,
128,132,136,140,144,148,152,156,160,164,168,172,176,180,184,188,
192,196,200,204,208,212,216,220,224,228,232,236,240,244,248,252,
0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,
32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,
64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,
96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126,
128,130,132,134,136,138,140,142,144,146,148,150,152,154,156,158,
160,162,164,166,168,170,172,174,176,178,180,182,184,186,188,190,
192,194,196,198,200,202,204,206,208,210,212,214,216,218,220,222,
224,226,228,230,232,234,236,238,240,242,244,246,248,250,252,254
};

Edited by Ed Minchau

Share this post


Link to post
Share on other sites

Just a little personal update: I have been very busy lately IRL. But I will be returning to this project. 😉

Also, the last few days/weeks I have been working on a (completely) new demo. And I am very excited about it :). 😃

Lets just say that the x16 is much more capable than I had thought...

  • Like 5

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...

Important Information

Please review our Terms of Use