Jump to content

Does this seem eggregious to anyone besides me?


ZeroByte
 Share

Recommended Posts

Okay - so in relation to a recent question regarding my zsound format and time resolutions supported, I stated that I was working out how to do divide 16bit number by 60 into a 16.8 fixed point value, so that I could make the files specify a tick rate in Hz. The typical default should be 60hz, but if someone wants to make a high-time-resolution file for some reason, the format supports it..... But here's the deal: my divide by 60 routine is huge - it added almost 200 bytes to the program and I presume that it takes a bit to execute. Fortunately, it only executes once when you tell the library to play a song, so it's not like it slows the player down.... but somehow it makes me cringe to have all of this code in there just for x/60.
Welcome to the world of 6502?

I'm sure Woz could do it in 18 lines of code that execute in -14 clock cycles, but I'm not Woz....

Here's the routine - any thoughts? Is this ridiculous and I should re-think my life, or is this just normal?

p.s. - if anyone can tell me how to tell the forum software to make this use highlighting and fixed-width font, I'd appreciate it (and fix this post)

.proc calculate_tick_rate: near
            ; X/Y = tick rate (Hz) - divide by 60 and store to zsm_steps
            ; use the ZP variable as tmp space
            
            value := r0            ; use kernal ZP r0 tmp space
            frac  := r1            ; but with meaningful names here
            stx value
            sty value+1
            stz frac
            ; value >> 6 (value = Hz/64)
            ldx #6
            jsr rshift3
            ; set step = value.
            lda value
            sta step
            lda value+1
            sta step+1
            lda frac
            sta fracstep
            ; currently, step = value = Hz/64.
            ; To make step = Hz/60, It should now get value/15 added to it.
            ; We do value/15 by successively dividing value by 16 and adding
            ; the result to step, which approaches the correct value with
            ; each pass. After 4 passes, we will have exceeded the precision
            ; of 16.4 fixed point, so that's when to stop.
            ldx #4 ; 4 means >> 4
            jsr rshift3 ; still need to rotate all 3 bytes
            jsr add_step
            ; 1 pass complete.
            jsr rshift2 ; falls through to add_step
            jsr rshift2
            ; 3 passes complete. The last pass is shortest (only frac remains)
            lsr frac
            adc fracstep
            ; when doing the carry bit adds, save the results in zsm_steps.zsm_fracsteps
            ; because we're finally done.
            sta zsm_fracsteps
            lda step
            adc #0
            sta zsm_steps
            lda step+1
            adc #0
            sta zsm_steps+1
            rts
            
rshift3:    ; rshift 3 byte value (X = number of shifts)
            lsr value+1
            ror value
            ror frac
            dex
            bne rshift3
            rts
            
rshift2:
            ldx #4    ; we're always >>4 in this phase
:            lsr value
            ror frac
            dex
            bne :-
            ; fall through to add_step to save a little CPU time
            ; i.e. no RTS followed by JSR add_step.
            
add_step:
            ; note that the carry flag is set by the rshift routine and is
            ; important, so no CLC statement as in the normal ADC usage...
            lda frac
            adc fracstep
            sta fracstep
            lda value
            adc step
            sta step
            lda step+1
            adc #0    ; high byte of value is now guaranteed to be 0.
            sta step+1
            rts

 

p.s. : amusingly, this is post #487 for me, which would have been the model number of the math coprocessor of the 486 had it not been integrated into the CPU by default. How funny that this post would be about a math routine in assembly!

Edited by ZeroByte
Link to comment
Share on other sites

The obvious answer is: change the format such that the result of this computation is what's stored in the header so the player library doesn't have to go through this rigamarole - just copy from header and start playing. I guess that makes sense too - but just in case folks think it makes MORE sense to store the value as Hz, this is what kind of crap it takes (for me anyway) to translate the value into something useful on-system for playback. Looking forward to people's thoughts....

The reason I think raw HZ might make more sense is that it's less complicated to go from that to some other timing source - e.g. the VIA or something. Starting with "ticks-per-frame" means having to multiply by some strange ratio - but then again, if you're wanting to drive the sound at 300 updates per second, (still less than native VGM) then chances are good that such computations don't concern you much anyway....

Edited by ZeroByte
Link to comment
Share on other sites

My 2 cents, maybe looking at this from a different perspective: I remember seeing a video on YT where someone made a music player for the Commander that had serious timing issues. Now of course I think your player will be much more accurate. But I still think that forcing an arbitrary timing format onto a 60 Hz grid will introduce noticeable jitter. The error can be up to ± 8 milliseconds, so the jitter will range up to 16 milliseconds. Even when working with Concerto, I found the rounding errors of its time step (around 7 milliseconds, less than half the duration of VSync) annoying, but bearable. But I still preferred having the music be in sync with the grid rather than relying on rounding to the grid.

So if you allow for arbitrary step durations but intend to play back at 60 Hz, prepare for some noticeable up to annoying artifacts with the timing. If you still want to allow for that, then you would need to support it. However, if you don't want it, then this time step conversion is not necessary.

On the other hand, I like that you want to make it possible to play back at a different step, if I understood your post correctly. In that case, would you not need a more general conversion routine than dividing by 60?

Link to comment
Share on other sites

Here's a 24 bit division routine from codebase64 that is an extended version of their 16 bit division routine (which I use in prog8), so I expect you can extend it once again to 32 bits.

It looks smaller than your code, but I didn't measure.

https://codebase64.org/doku.php?id=base:24bit_division_24-bit_result

Also here's another person with a different looking 32 bits division routine but arguing that it is useful for small code sizes   https://atariage.com/forums/topic/237463-looking-for-32-bit-division-routines/?tab=comments#comment-3240032

I haven't tried both of them but perhaps they're of some use to you

 

but if it's always division by 60, perhaps you can cheat a bit?   Start with division by 64 (which is a simple shift) and maybe this is precise enough already? otherwise perhaps there's a way to adjust the result somewhat to make it more precise, I don't know

Edited by desertfish
Link to comment
Share on other sites

On 11/20/2021 at 5:43 AM, desertfish said:

Here's a 24 bit division routine from codebase64 that is an extended version of their 16 bit division routine (which I use in prog8), so I expect you can extend it once again to 32 bits.

It looks smaller than your code, but I didn't measure.

https://codebase64.org/doku.php?id=base:24bit_division_24-bit_result

Also here's another person with a different looking 32 bits division routine but arguing that it is useful for small code sizes   https://atariage.com/forums/topic/237463-looking-for-32-bit-division-routines/?tab=comments#comment-3240032

I haven't tried both of them but perhaps they're of some use to you

 

but if it's always division by 60, perhaps you can cheat a bit?   Start with division by 64 (which is a simple shift) and maybe this is precise enough already? otherwise perhaps there's a way to adjust the result somewhat to make it more precise, I don't know

Surely there is ... since 60 is just UNDER 64, the true result will be equal to or greater than the result of divide by 64.

So multiply the result of the divide by 64 times 60, subtract that product from the original value. Multiplying times 60 is (X*16-X)*4, so 6 shifts left and a 16bit subtract.

For a 16bit unsigned integer, the residual can't be as big as 4,200, so you can do division by repeated subtraction of 960 (64*16, maximum 4 iterations), 240 (64*4, maximum 3 iterations) and 64 (maximum 3 iterations):

  1. START: TUNE if the residual is less than 960
  2. Subtract 960 from the product and increment the trial result by 16
  3. Back to START
  4. TUNE: FINETUNE if the residual is less than 240
  5. Subtract 240 from the product and increment the trial result by 4
  6. Back to TUNE
  7. FINETUNE: FINISHED if the residual is less than 60
  8. Subtract 60 from the product and increment the trial result
  9. Back to FINETUNE
  10. FINISHED

That has at most 10 iterations. I don't know whether it would be smaller than the above, but it seems like it would test out as faster.

Edited by BruceMcF
Link to comment
Share on other sites

I'll have to look at that. My routine does almost exactly that, except repeated addition.

My x/60 routine is essentially: X = X>>6 + X>>10 + X>>14 + X>>18 + X>>22

@kliepatschThe Hz/60 computation is strictly my own implementation of a method to do arbitrary rate playback in once-per-frame chunks.I was close to having it working last night... Just ran out of time (4am). The single-step routine is exposed as a direct call, so you could set up a VIA IRQ and call it at exactly the right rate if that's what you need to do. 

My import script already down samples time from 44100hz to 60hz, and I've never detected any jitter in the results. My Sonic demo used it. Other audio demos I've shared to YouTube do it. It just doesn't seem to make any perceptible difference. I DO see an issue in some cases tho - for instance in the audio I'm helping with on City Connection, the tunes have PSG volumes being done by ADSR routines which updated the volume faster than 1/60, so what happens is that the peak volumes are being overwritten in-frame by the subsequent lower volumes. The net result is that the volume of the PSG is too low because the "attacks" get overwritten by the first couple of decay adjustments. This is a problem in the "encoder" though - I could add cmdline switches to select different methods (such as "use max volume over the frame" or "use average volume over the frame" etc) to smooth out these kinds of issues.

I'll soon see what it sounds like to play these 60hz ZSM files at varying speeds. That's what all this time stuff opens up as a feature: playing back the same tune faster or slower. So you can speed up the music like in Sonic when you get the speed sneakers powerup.

I've done some calcs to see what kind of error creeps in when the result isn't an integer (i.e. the file's rate isn't an even multiple of 60).

One of my test values is 31,217hz. (just picked an arbitrary weird number). This would be 520.2833333 steps per frame. My /60 routine ends up with 520.28125 - 0.1% error. This means the song will play back ever so slightly too slow. But since I'm using the timing equivalent of "subpixel scrolling" - I expect the playback to remain smooth in such cases. Essentially, just a little more than 1 in 4 frames will play 521 steps instead of 520.

We shall see.

Again, I believe this entire thing to be a bit of an edge case - 60Hz resolution is strongly recommended. My "on-the-fly resampling" method is basically a way to have built-in support for tunes with non-60Hz rates which my tools can't even generate at the moment, but as this is a "reference implementation" it would be dumb not to support your own features, right? 🙂

 

Edited by ZeroByte
Link to comment
Share on other sites

It seems like, for size rather than speed ... couldn't you just not optimize it, just do 24bit divide by 8bit of the 16bit dividend with the 0 fractional byte.

I think it might be something like the following. If "N" is in the zero page, that makes it under 40 bytes. But I stress, this is looking at a 64bit/32bit routine and transposing, it is absolutely untested, so it would be totally unsurprising if there is some real obvious mistake below.

; Value passed in AX, result returned in N...N+2. Remainder is in N+3, but as the result is in 16.8 format, the remainder can be ignored unless it is desired to round up for remainders above 127.
STZ N
STA N+1
STX N+2
STZ N+3
; the first five trial subtracts will always fail ... for space optimization, just let them
LDX #$19 ; 24 = $18, plus 1 to shift the final result bit in.
CLC
; Shift dividend one bit left
LOOP:
: ROL N
: ROL N+1
: ROL N+2
: DEX
: BEQ END
: ROL N+3
: LDA N+3
: SEC
: SBC #60
: BCC LOOP ; trial failed
: STA N+3
: BRA LOOP
END:
; optionally to round 16.8 to closest based on residual remainder
;   LDA N+3
;   BPL +
;   INC N
;   BNE +
;   INC N+1
;   BNE +
;   INC N+2
: + RTS

 

Edited by BruceMcF
  • Thanks 1
Link to comment
Share on other sites

On 11/20/2021 at 7:02 PM, ZeroByte said:

My import routine already down samples time from 44100hz to 60hz, and I've never detected any jitter in the results. My Sonic demo used it. Other audio demos I've shared to YouTube do it. It just doesn't seem to make any perceptible difference.

I just tried a couple of things with Concerto and indeed, the jitter is not as bad as I remembered. When making music with Concerto a couple of weeks ago, the jitter was bothering me a lot. It was barely noticeable, but because I KNEW the jitter was real, it kept distracting me all the time, so I had to adapt the song tempo to some integer tick count. And thinking that at 60 Hz, jitter could be twice as bad, it simply wouldn't be fun to make non-60 Hz music. Just playing it back would probably not be as bad.

Link to comment
Share on other sites

I've sat there in Deflemask waffling over a sub-step delay value on a note. You can definitely tell the difference when you're the source and hyper-focused on it. Just listening in full context though, meh. It's like agonizing over a particular pixel in Photoshop if you're clipping along an edge.

Link to comment
Share on other sites

On 11/20/2021 at 3:03 PM, ZeroByte said:

@BruceMcFthe subtraction test says lda N3 .. sta N3.  Is that supposed to be N+3 ?

So n1-3 holds a 16.8? The .8 part isn't a modulo remainder is it?

N be should holding the ".8" (starting as 0, assuming a 16bit dividend), N+1:N+2 holding the 16bit whole part of the dividend, ... rotate it into N+3, trial subtract, inverted carry as borrow in the 6502 family means when the trial subtraction generates a borrow, that is a carry clear, a 0 bit in the result, when the trial subtraction does not generate a borrow, that is carry set, a 1 bit in the result. Save the trial subtraction result when no borrow is generated, so N+3 holds the ongoing remainder through to the end of the process, but the remainder doesn't much matter if you have a 16.8 result.

The "CLC" at the start puts a zero bit in the low bit of the dividend | result, which should be shifted into the carry flag by the 25th iteration of the rotate of the dividend|result location.

Edited by BruceMcF
Link to comment
Share on other sites

On 11/20/2021 at 3:28 PM, SlithyMatt said:

My 16.8 fixed point library will do division by whole numbers (the fraction is lopped off, as I didn't need that, but it looks like you don't, either, if you are just dividing by 60): https://github.com/SlithyMatt/multi-mandlebrot/blob/main/6502/fixedpt24.asm

Yes, the most space optimal is "use whatever 16.8 divide you have", if you should happen to have one. 16.8 / 16 -> 16.8 would be slightly slower because of the 16bit trial subtraction, but that's only an issue if optimizing for speed.

With a 16.8 / 16 -> 16.8 in a library ... I would consider testing the high byte of the divisor and using a 16.8 / 8 -> 16.8 internally when the high byte is zero.

Edited by BruceMcF
  • Like 1
Link to comment
Share on other sites

Just rewatched Ben Eater's video on converting binary to decimal using essentially this algorithm, and I totally get it now.

To round at the end, if N+3 >= $80 then inc the value in N0-2.

I'm going to use this algo. Thanks!

So, a night wasted on producing code that I won't use, but valuable in that it produced more knowledge so worthwhile anyway.

Edited by ZeroByte
  • Like 1
Link to comment
Share on other sites

I think for fun I'm going to count cycles and compare my algo's speed vs the generic divide by anything algorithm. It'll be I treating to see if I'm faster or slower than just looping.

Wait - Box16's debugger will count them for me. Just step over the initial JSR and look at the cycles indicator.

  • Like 1
Link to comment
Share on other sites

No that's backwards.

VGM's native sample tick rate is 44100Hz. Hence if someone wanted to convert to ZSM format 1:1, the ZSM should advance at 44khz. (That's what I did for my first successful VGM 2 X16 conversion for Kevin back in 2019 when he needed to test YM2151's ability to read from the system bus at 8Mhz.) That works out to 730-something ticks per frame at 60Hz.

So the sample rate header is 16bit. Another example is iD's IMF format. Wolf3d's tick rate is like 11 ticks per frame for some reason... That's 660Hz, which is also a 16bit number.

Obviously it's way too CPU expensive to actually drive an IRQ at 44.1Khz for almost anything other than a dedicated player, but since "sane" values also require more than 8 bits, I figure why not reserve 2 header bytes instead of one?

Going the other way, the slowest tick rate would be 1hz which is 1 tick every 60 frames.

My playback routine's storage format of 16.8 covers this range, as this would be on the order of $0000.02

But such a slow rate is obviously absurd for music.

Link to comment
Share on other sites

Bear in mind that details like this are exactly what I want to spare developers from having to learn about. The basic goal is to provide a turnkey solution for having music playback in projects. The routine I'm doing this math for is a simple way to avoid worrying about how frequently to call the step_music function. Playmusic should be called once per frame and it just steps the music however many times is required by the header data's specification. It keeps track of partial steps and calls step_music one extra time whenever the residual rolls over.

In a nutshell, it "resamples" to 60Hz. It may not use the BEST method, but you're free to generate whatever Hz is specified in the header and call step_music directly if that's what you require.

Link to comment
Share on other sites

I just did a comparison between my algorithm (bespoke), @BruceMcF's General algorithm, and one that @Scott Robison proposed on Discord (Galaxybrain):

Galaxybrain took my algorithm and enhanced it by using clever byte-order manipulation to cut down the number of shifts required to perform the computation.

For computing 31,217/60 into 16.8 fixed point format: (answer = 48 08 02)

galaxybrain: 46 08 02 , 230 cycles, 159 bytes
bespoke: 46 08 02 , 531 cycles, 122 bytes
general: 48 08 02 , 3813 cycles, 70 bytes

As you can see, bespoke is pretty much what I said - a bit of a balance between speed and size (I could've gotten about 30-60 less cycles, but would've been at least as large as Galaxybrain.

My computation X = X>>6 + X>>10 + X>>14 + X>>18 + X>>22 is slightly off from the answer I computed by hand. Not sure why, as it's the same algorithm, and it didn't have this slight error in my main library. (I assembled all these into a testbed project). Galaxybrain gets the same error though, so not sure what's going on... Anyway, that's the results. As BruceMcF states - the general purpose answer is clearly the winner in size optimization (no shock there). As this calculation is only done once when a song is loaded, that may not be the biggest sacrifice in performance to make. It could probably be optimized a little for my application by skipping the first 5 loops and pre-shifting everything.

For reference, right after starting a song, it's usually going to load quite a few YM bytes (assuming it's got FM tracks) which is slow, about 150 cycles per write (due to having to wait on the YM to be ready for each successive write). So the first two algos pretty much equate to just a few YM writes' worth of time.

Well, that's the tale of the tape, folks. This has been a fun exercise.

Edited by ZeroByte
  • Like 4
Link to comment
Share on other sites

Putting the general 16.8 / #60 -> 16.8 algorithm into the format of the original ... but using the new API scratch register space, on the theory that a general set-up process may be using any of r0-r10 for holding values used in the process.

As with the original use of r0-r2, this shouldn't be placed in an interrupt routine.

I also noticed that since the remainder is not the actual remainder, but the residual remainder after calculating the /256th fractional part, moving the loop indexing and exit saves two bytes as the "rol rem" required for the first 24 iterations can be used in the final iteration to get the high bit of the residual remainder into the carry flag for rounding the final result.

A third byte may be saved by omitting the "clc" at the start the loop, which is a bit which flows through but is not part of the final result. If carry could be set or clear at the beginning of the process, having this garbage bit can confuse debugging, but it doesn't affect the result.

On the other hand, the "*32" pre-shift to speed up the process, because when dividing by 60, the first five iterations are known to fail the test subtraction, adds around 11 bytes. The speed optimization is omitting one byte from the shifting, replacing five left shifts with an equivalent three right shits, unrolling the loop, and being able to use the register for the shift.

.proc calculate_tick_rate: near
            ; X/Y = tick rate (Hz) - divide by 60 and store to zsm_steps
            ; use the ZP variable as tmp space

            value := r11
            frac := r12
            rem := r13

            stx value
            sty value+1
            stz frac
            stz rem

            ; the first five trial subtracts will always fail ... for space optimization, just let them
            ldx #25        ; 24 trial subtracts, plus 1 partial loop to shift the final result bit in.

            ; may be omitted
            clc                  ; this bit will be shifted into bottom of residual remainder at end
                                   ; avoiding a possible garbage bit floating through makes it easier to debug

loop:
            ; Shift dividend one bit left, then into remainder byte
            ; In iterations 1-24, the remainder is prepared for next trial subtract
            ; In iterations 2-25, the 24 result bits are shifted in to replace the value.
            rol frac
            rol value
            rol value+1
            rol rem ; in last iteration, residual remainder high bit is in carry
            dex
            beq endloop
            lda rem
            sec
            sbc #60
            bcc loop
            sta rem
            bra loop

endloop:
            ; round up if residual remainder is >=$80
            ; high bit of residual remainder is already in carry
            lda frac
            adc #0
            sta zsm_fracsteps
            lda value
            adc #0
            sta zsm_steps
            lda value+1
            adc #0
            sta zsm_steps+1
            rts

~~~~~~~~~~~~~~~~~~~
For the slight speed optimization since the first five iterations are known to fail:
            ; the first five trial subtracts will always fail, so pre-shift by five
            ; start at destination and shifting right by three gives the same effect
            stz value
            stz frac
            stx value+1
            tya
            lsr
            ror value+1
            ror value
            lsr
            ror value+1
            ror value
            lsr
            ror value+1
            ror value
            sta rem

            ; Now do the remaining 19 (of 24) shifts, plus 1 to shift in the final result bit
            ldx #20

Edited by BruceMcF
Link to comment
Share on other sites

On 11/20/2021 at 1:20 PM, kliepatsch said:

I just tried a couple of things with Concerto and indeed, the jitter is not as bad as I remembered. When making music with Concerto a couple of weeks ago, the jitter was bothering me a lot. It was barely noticeable, but because I KNEW the jitter was real, it kept distracting me all the time, so I had to adapt the song tempo to some integer tick count. And thinking that at 60 Hz, jitter could be twice as bad, it simply wouldn't be fun to make non-60 Hz music. Just playing it back would probably not be as bad.

Well, since last night was "algorithm drag racing" I didn't actually implement my HZ conversion in the player, but tonight I did, and it works like a champ. I haven't made a "setspeed" function yet, but using Box16 to poke new speed values directly into RAM, it works like a champ. Once I make a function, I guess I should make a speed change feature in my Sonic demo where you can press up/dn arrow keys, and Sonic will speed up and so will the music. That'd be kind of cool.

  • Like 3
Link to comment
Share on other sites

On 11/20/2021 at 9:35 PM, ZeroByte said:

I just did a comparison between my algorithm (bespoke), @BruceMcF's General algorithm, and one that @Scott Robison proposed on Discord (Galaxybrain):

Galaxybrain took my algorithm and enhanced it by using clever byte-order manipulation to cut down the number of shifts required to perform the computation.

It occurs to me I gave you generic 6502 code. Did you make it 65C02? Could save a few more bytes...

  • Like 1
Link to comment
Share on other sites

You could approximate. The tick range isn't going to be that large, I wouldn't have thought. 1/60 (0.016667) is pretty close to 1/16 (0.015625), better done here obviously as 4/256. The error is about 6.5%.

This is a bit high, high enough to be noticeable perhaps, so you could have a fudge factor or factors to add based around a variety of tick rates, given the limited range of tick in practice. You could knock up some numbers in a spreadsheet and get it near enough for practical purposes.

It's even debatable whether there's any point in representing it as a fixed point fraction, I don't think many people are going to be able to tell a difference of 1-2% in tempo.

Link to comment
Share on other sites

On 11/22/2021 at 3:20 AM, paulscottrobson said:

I don't think many people are going to be able to tell a difference of 1-2% in tempo.

You're definitely correct here. From playing around with the playback speed functionality, even at lowly 60Hz resolution, you don't hear much difference in tempo when modifying the fractional step in amounts less than 0x08.

I think the accuracy might come into play a little more in programs that are trying to keep music and something else synchronized - over time, even very small error rates accumulate into noticeable quantity. Obviously I'm not trying to address this, as 8 fractional bits is not really "accurate" for such tasks either.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...

Important Information

Please review our Terms of Use