Jump to content
TomXP411

File I/O performance

Recommended Posts

I was curious how fast the Commander can actually read files from disk, so I fired up the emulator and did some quick tests. 

I wrote a very simple assembly routine that lives in Golden RAM (the section at $400), and timed it to see how fast it ran. 

With a 32K file, it takes 155 ticks, or 2.58 seconds. That makes file read throughput 12.7KB/s. 

What does that mean, in terms of real numbers?

You can fill the BASIC program area ($800-$9EFF) in 3 seconds

An 8K bank will load in 0.64 seconds

You can load all 512KB of banked memory in about 42 seconds..

This is far less than the CPU's theoretical maximum throughput, and it averages to roughly 125 machine language instructions per byte read. I'm not sure if the time spent is just due to the FAT32 driver, or if there are some delays in the emulated hardware. That's something else to look at. 

In the meantime, the next step would be to evaluate the speed of popular decompression routines. Exomizer and PuCrunch seem to be the most common 6502 compression systems right now.

** Edit: this has caused a little confusion, since I revised these numbers. The original result was around 9.8KB/s, and used the code attached. I got a marginal increase in performance by skipping the READST KERNAL routine and reading $286 directly. The problem here is that $286 is not frozen and could change.

So I'm going to request (or supply) a small change to the KERNAL to return the current address of the ST variable, so we can query it directly. 

 

 

chkin.asm

 

 

Edited by TomXP411
  • Like 1

Share this post


Link to post
Share on other sites

In the file assembler topic we had some peculiar findings regarding i/o speed as well .   To me, it seems that there is something going on in the emulator that skews the results

 

Share this post


Link to post
Share on other sites

Huh. I was looking for that conversation and didn’t realize it was part of the native assembler conversation. 
 

I should probably look at the code in the emulator and see if it’s adding a delay... I had that experience with a Tandy 100 emulator a few months ago. 

Edited by TomXP411

Share this post


Link to post
Share on other sites

The emulator shouldn't be adding any delays to file I/O. If you're timing wall-time, you might be seeing efficiency problems in the emulator's implementation, but I'm not sure why they'd go away in the interrupt handler. The interrupt handler looks no different to the emulator code, it's still just running a loop to process CPU instructions and update the state of various emulated subcomponents.

I'm guessing that at least part of the explanation is that doing loads in the interrupt mean that the loads aren't being interrupted on every vsync when the VERA signals a new frame has occurred. 4 second load = 240 interruptions from VSYNC, during which the kernal does its ordinary per-frame stuff. If you want to make sure a load outside of the interrupt handler isn't interrupted, insert an SEI instruction before starting it. Just make sure to also have a CLI instruction afterwards.

Share this post


Link to post
Share on other sites

The weird thing is that doing SEI + CLI outside the interrupt handler doesn't save much time. It's still a mystery to me.

Share this post


Link to post
Share on other sites
3 hours ago, StephenHorn said:

The emulator shouldn't be adding any delays to file I/O. If you're timing wall-time, you might be seeing efficiency problems in the emulator's implementation, but I'm not sure why they'd go away in the interrupt handler. The interrupt handler looks no different to the emulator code, it's still just running a loop to process CPU instructions and update the state of various emulated subcomponents.

I'm guessing that at least part of the explanation is that doing loads in the interrupt mean that the loads aren't being interrupted on every vsync when the VERA signals a new frame has occurred. 4 second load = 240 interruptions from VSYNC, during which the kernal does its ordinary per-frame stuff. If you want to make sure a load outside of the interrupt handler isn't interrupted, insert an SEI instruction before starting it. Just make sure to also have a CLI instruction afterwards.

I'll give that a try right now (although it sounds like @stef

Share this post


Link to post
Share on other sites

Re compression.  Doesn't our nice Kernal have some shiny new decompression routines built-in?

 

Share this post


Link to post
Share on other sites
13 minutes ago, rje said:

Re compression.  Doesn't our nice Kernal have some shiny new decompression routines built-in?

I saw something about that. It's worth investigating.

So here's the thing: the 9.8KB/s I measured means an average of 160 or so instructions per byte read. If the compression takes, say, 80 instructions per byte, then it's going to need to compress at 2:1 or better to be worth the effort. I think this will all come down to performance testing (and, of course, a determination of whether the emulator is behaving accurately.)

 

Share this post


Link to post
Share on other sites
4 hours ago, Stefan said:

The weird thing is that doing SEI + CLI outside the interrupt handler doesn't save much time. It's still a mystery to me.

Same. I got around 2.5-2.7 seconds with interrupts disabled. It was hard to measure accurately, because without interrupts, TI doesn't get updated. 

With interrupts enabled, I get 2.58 seconds. So the overhead of the interrupt handler seems minimal.

Although I'm going to have to revise my earlier numbers. With the simplest possible loop, I'm getting 32K In 2.58 seconds, which adds up to 12.7KB/s. I'll go back and revise my initial post. 

That's still an order of magnitude less than the theoretical maximum, but 12KB/s is better than my original result. I can't explain the difference, though, and that bothers me.

** I figured out the difference. My first test used the READST API. The second time around, I just did an LDA FILE_STATUS, which reads directly from the system RAM location. This is unsafe, as the location may change. 

What surprises me is that READST is that inefficient. It added something like 30% to the overall runtime of the routine. 

Edited by TomXP411

Share this post


Link to post
Share on other sites

So I stepped through the KERNAL code. There's no mystery here, after all... the FAT driver is just a lot of code. I kind of lost track at 100+ steps in the KERNAL. 

I think performance can be improved by buffering data to RAM, which should bring individual character reads down to about 10 instructions. Obviously there's time involved in buffering each block, but since most of the code would be setting up a sector read, that's much more efficient when read a whole sector into memory at once. 

 

 

Edited by TomXP411

Share this post


Link to post
Share on other sites
33 minutes ago, TomXP411 said:

** I figured out the difference. My first test used the READST API. The second time around, I just did an LDA FILE_STATUS, which reads directly from the system RAM location. This is unsafe, as the location may change. 

What surprises me is that READST is that inefficient. It added something like 30% to the overall runtime of the routine. 

 

4 minutes ago, TomXP411 said:

So I stepped through the KERNAL code. It's just a lot of code. I kind of lost track at 100+ steps in the KERNAL.

Woah, how'd that happen? I'm digging through disassembled code in the emulator, and if you're starting from ROM bank 0, that should be:

Quote

JSR $FFB7 ; 6 cycles, calling READST

JMP $D646 ; 3 cycles, READST jumps into the kernal implementation
LDA $0286 ; 4 cycles
ORA $0286 ; 4 cycles
STA $0286 ; 4 cycles
RTS ; 6 cycles, returning to the calling code

Edit: If you're not starting from ROM bank 0, say you're in ROM bank 4 (BASIC), there may be hoops it jumps through to trampouline into the ROM implementation. Or there may be another implementation altogether that assumes we're running a BASIC command...

Edited by StephenHorn

Share this post


Link to post
Share on other sites
2 minutes ago, StephenHorn said:

Woah, how'd that happen? I'm digging through disassembled code in the emulator, and if you're starting from ROM bank 0, that should be

Crud. Now I need to test both methods again. I am thinking I might have recorded the result wrong on my first test. 

I do have to wonder at what that code is doing, though... why OR and save the value back? ORing a number with itself is just... the same number. 

 

Edited by TomXP411

Share this post


Link to post
Share on other sites
4 minutes ago, TomXP411 said:

Crud. Now I need to test both methods again. I am thinking I might have recorded the result wrong on my first test. 

 

 

An easy thing to miss in ASM projects is that "RUN" on a SYS command still leaves you in ROM bank 4 to run the BASIC command. So an ASM-based program probably wants to manually set its ROM bank to 0 fairly early in execution. This also makes the interrupt handler much, much lighter as well, since it doesn't rely on BASIC code to catch the interrupt and then trampouline into proper kernal handling.

That said, I have no idea whether it's possible to exit gracefully after that by resetting the ROM bank to 4 before your program's final JSR instruction.

Edited by StephenHorn

Share this post


Link to post
Share on other sites

Thanks, I'll keep that one in mind. This particular test is bookended in BASIC code to manage the timing.

Anyway, I looked at the original Commodore ROM, and it's the same. However, there is an additional label there, which updates the status byte. So it looks to me like someone decided to save some ROM space by cramming the read and update code together. 

https://www.pagetable.com/c64ref/c64disasm/#FE07

And yes - it really does add that much time. 208 Jiffys using JSR READST and 155 Jiffys using LDA $0286

I don't think it's worth worrying about for small files, but when you're loading 100+KB into banked RAM, I think that will make a noticeable difference. So it's something to bear in mind. 

On the bright side, I don't think there's actually anything to lose by reading past EOF... you'll just get back nulls, and if you expect a certain block size in your file, that's not an issue. So calling READST once every block, rather once every byte, is going to save some CPU time. 

 

Edited by TomXP411
  • Like 1

Share this post


Link to post
Share on other sites

Ah, I was wondering what was up with the ORA and STA before getting back to the RTS. That makes sense, and allows them to save a byte from SETMSG as well, since that too runs straight through READST and UDST. Clever. The academic in me wonders if they secretly depend on that execution path anywhere in the kernal. The horrified engineer in me is wondering if those dependencies are documented. 😛

  • Like 1

Share this post


Link to post
Share on other sites
33 minutes ago, StephenHorn said:

Ah, I was wondering what was up with the ORA and STA before getting back to the RTS. That makes sense, and allows them to save a byte from SETMSG as well, since that too runs straight through READST and UDST. Clever. The academic in me wonders if they secretly depend on that execution path anywhere in the kernal. The horrified engineer in me is wondering if those dependencies are documented. 😛

It's a runon from SETMSG, which you wouldn't want to change. So we'd need a new API entry. 

However, I'm thinking the smart thing is to avoid the vector table and ROM completely and just read the STATUS byte directly from RAM. Of course, to do that, we need to know the address of STATUS. So I'm thinking the best idea is a new API that gives us a pointer to STATUS. 

Get Status Pointer, or GETSTPTR

This would save the location of the STATUS variable in zero page, at whatever address the user wants:
This would take X as input and save the STATUS variable in byte X of Zero Page.

GETSTPTR:
    LDA STATUS
    STA 0,X
    LDA STATUS+1
    STA 1,X

To call the setup routine in user code:
LDA X,#MYSTATUS
JSR GETSTPTR

and then programs can simply query STATUS with...

Loop:
    JSR GETIN
    do stuff
    LDA (MY_STATUS)
    BEQ Loop


 

 

Edited by TomXP411

Share this post


Link to post
Share on other sites
6 hours ago, StephenHorn said:

I have no idea whether it's possible to exit gracefully after that by resetting the ROM bank to 4 before your program's final JSR instruction.

It is. That's what I do in most of my programs, and it works just fine as long as I have the bank set back to 4 before the RTS that should take us back to BASIC. However, certain RAM and VRAM states may make that difficult, so there may need to be other resetting needed.

  • Like 1

Share this post


Link to post
Share on other sites

Interesting stuff. 

I have three test programs to verify this, all reading the 73 kB source code file provided by @desertfish in the native assembly thread.

  1. The original test program, interrupt disabled
  2. The same + changing ROM bank to 0 at start
  3. As no 2 + but reading the status directly from memory (status address = $0286, hope I didn't get that mixed up).

I clocked these manually (average of three runs per test program):

  1. 7.8 s (7.76 s + 7.76 s + 7.75 s) => 9.3 kB/s
  2. 4.4 s (4.33 s + 4.30 s + 4.42 s) => 16.6 kB/s (78 % faster)
  3. 4.1 s (4.12 s + 4.07 s + 4.12 s) => 17.8 kB/s (91 % faster)

The bank switching really seems to be the culprit. The READST function is also hitting the performance, but not nearly as much as the bank switching.

test_banking_status.asm test_banking.asm test.asm

Share this post


Link to post
Share on other sites

A side note.

The ROM version of X16 Edit I'm working cannot avoid bank switching when reading files. The code is in ROM bank 7, and for each byte it reads it must go the Kernal ROM bank 0.

However, I'm not using the Kernal JSRFAR function, but my own minimal bank switching code stored in low RAM ($0400-7FFF). Opening the same 73 kB file is done in about 5.3 s => 13.8 kB/s. The editor RAM version doesn't have to do the bank switching, and is a bit faster, loading the file in about 4.5 s => 16.2 kB/s.

This is the bank switching code. Before calling this routine the address in jsr $ffff needs to be changed to the address you actually want to call. I use a macro to make this "self modification" safe.

bridge_kernal:
stz ROM_SEL ;Kernal is ROM bank 0
jsr $ffff ;$ffff is just placeholder
pha
lda #ROM_BNK ;Set ROM select to our bank again
sta ROM_SEL
pla
rts ;14 bytes

 

Share this post


Link to post
Share on other sites
On 1/15/2021 at 10:57 PM, Stefan said:

A side note.

The ROM version of X16 Edit I'm working cannot avoid bank switching when reading files. The code is in ROM bank 7, and for each byte it reads it must go the Kernal ROM bank 0.

However, I'm not using the Kernal JSRFAR function, but my own minimal bank switching code stored in low RAM ($0400-7FFF). Opening the same 73 kB file is done in about 5.3 s => 13.8 kB/s. The editor RAM version doesn't have to do the bank switching, and is a bit faster, loading the file in about 4.5 s => 16.2 kB/s.

This is the bank switching code. Before calling this routine the address in jsr $ffff needs to be changed to the address you actually want to call. I use a macro to make this "self modification" safe.

bridge_kernal:
stz ROM_SEL ;Kernal is ROM bank 0
jsr $ffff ;$ffff is just placeholder
pha
lda #ROM_BNK ;Set ROM select to our bank again
sta ROM_SEL
pla
rts ;14 bytes

 

You might save some time by reading larger blocks -  maybe copying your I/O code to low RAM when the editor starts. 

  • Like 1

Share this post


Link to post
Share on other sites
18 minutes ago, TomXP411 said:

You might save some time by reading larger blocks -  maybe copying your I/O code to low RAM when the editor starts. 

Yes, copying the read loop to RAM is a good option if you like to avoid bank switching and improve performance. Possibly it's sufficient to copy the inner loop that's run for every byte, and accept that other parts of the read function that's run more seldom is in the other ROM bank.

I would not like to gain performance by not calling READST or reading the status byte (if that can be done directly in a safe way) for every iteration of the read loop.

 

Share this post


Link to post
Share on other sites
On 1/15/2021 at 3:55 AM, TomXP411 said:

I was curious how fast the Commander can actually read files from disk, so I fired up the emulator and did some quick tests. 

I wrote a very simple assembly routine that lives in Golden RAM (the section at $400), and timed it to see how fast it ran. 

With a 32K file, it takes 155 ticks, or 2.58 seconds. That makes file read throughput 12.7KB/s. 

Hum... not ok with your figures.
I took your asm with a 32K file and using the internal ticks counter, I got for the asm part only: 27 502 015 ticks.
That means for a 8MHz cpu :
27 502 015 / 8 / 1000 = 3437ms ~ 3.44s which make ~ 9.52KB/s

I did it again with a little change on your code in order to check how many bytes were read
image.png.3a7f8174fe3c4a4c7dcf147848dda92d.png

No suprise as I got $8000 bytes read.
It took around 3.47s
image.png.082f5a7e282b2dc38f09b3214bb56677.png

Now if I try to do the same without the old C64 kernal layer, calling directly the fat32 code, I got a sligthly better figure 😉

image.png.720bb98f6e0f04edddaea3634fdbe32c.png

 

Edited by kktos

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...

Important Information

Please review our Terms of Use