Jump to content
  • 0
Johan Kårlin

Updating screen on vblank takes too long

Question

In my ongoing game project, I am struggling with the scrolling routines. I use the tilemap as a window into a much larger game world. Right now I have a coding solution that is short and clean but comes with the cost of having to update the whole visible part of the tilemap at certain times. To be able to do this as fast as possible at vertical blank, I have a tile buffer prepared, so all I have to do is to copy 672 bytes from the buffer to VRAM. But the routine is not fast enough. It takes about 24 scanlines which of course causes a tearing of the upper part of the screen. If the screen is updated 60 times/second, the display is about 250 scanlines (I think there are a few more than those actually seen) and the emulator runs at 8 MHZ - there should be 8 000 000 / 250 / 60 cycles = 533 cycles for every scanline. That is not very much (but a lot more than on a C64 - I know : )).

I am thinking of different solutions:

1) I can rewrite the scrolling routines but I would be happy not to. They get complicated and messy real quick.

2) I can update the first rows of the tilemap directly when the tile buffer is prepared. At that point the screen updating has passed the upper part of the screen. And then I update the rest at vblank. Not a completely satisfying solution...

3) Or is there some other clever solution? How many cycles do I really have between vblank and the moment where ther topleft pixel of the screen is updated? 

Below is the code for updating the tilemap:

;update topleftmost 21x16 tiles of 32x32 tilemap
        lda #<_tilebuffer       ;set tilebuffer pointer
        sta ZP0
        lda #>_tilebuffer
        sta ZP1

        lda #<L0_MAP_ADDR       ;set tilemap pointer
        sta VERA_ADDR_L
        lda #>L0_MAP_ADDR
        sta VERA_ADDR_M
        lda #$10
        sta VERA_ADDR_H 

        ldy #16                 ;16 rows
--      ldx #21                 ;21 columns of 2 bytes each

-       lda (ZP0)               ;copy tile from buffer to tilemap
        sta VERA_DATA0          ;write first byte of column
        inc ZP0
        bne +
        inc ZP1
+       lda (ZP0)
        sta VERA_DATA0          ;write second byte of column
        inc ZP0
        bne +
        inc ZP1
+       dex
        bne -

        clc                     ;add (32-21)*2 = 22 bytes to get addr for next tilemap row
        lda #22
        adc VERA_ADDR_L
        sta VERA_ADDR_L
        lda VERA_ADDR_M
        adc #0
        sta VERA_ADDR_M
        dey
        bne --

Share this post


Link to post
Share on other sites

24 answers to this question

Recommended Posts

  • 0

r37 has a minor bug that, for games' sake, we really ought to fix. The VERA emulation counts scanlines starting at the beginning of the VGA front porch, then after the front porch it begins drawing the 480 visible lines, then counts scanlines through the end of the VGA back porch... and only then triggers the vblank interrupt. This is incorrect timing, the VERA hardware will trigger the vblank interrupt at the proper start of vblank. (Also: Crud! I keep forgetting about this.)

What this means is that the emulator isn't giving you as much vblank time as you should have. Since line IRQs are also broken on r37, there is presently no workaround.

I should check whether there's an issue about this on Github yet... and make one if there isn't. It would be a good first issue for someone interested into contributing.

Share this post


Link to post
Share on other sites
  • 0

That’s good news. Then maybe there will be time for copying my 672 bytes after all. Do you have any estimate of how many cpu cycles there will be between vblank and the moment when the first visible line is updated when everything works?

I have noticed if I do the update directly in the irq handler, I have a few lines before entering the visible part of the screen. But if I use WAI and do the update in the main thread, I am already on the second visible line when my code starts to run...

Share this post


Link to post
Share on other sites
  • 0

You should have roughly 11,428 cycles with a full vblank, the timing of the VERA emulation is only giving you around 5,587 cycles.

Specifically, the cycles per scanline is 8,000,000 / 525 / 60, or about 253.968 cycles. You should have 45 scanlines' worth of cycles during vblank (525 scanlines/frame, minus 480 visible lines), but the timing issue means you only get 22 scanlines' worth of cycles.

  • Like 1

Share this post


Link to post
Share on other sites
  • 0

Something else to bear in mind, as well, is that even if you adjust DC_VSCALE to draw the screen as if it's 240 lines tall, the VERA is still operating in 640x480 mode. It's just scaling the internal graphics, not actually changing the resolution and pixel clock.

Share this post


Link to post
Share on other sites
  • 0
1 hour ago, StephenHorn said:

r37 has a minor bug that, for games' sake, we really ought to fix. The VERA emulation counts scanlines starting at the beginning of the VGA front porch, then after the front porch it begins drawing the 480 visible lines, then counts scanlines through the end of the VGA back porch... and only then triggers the vblank interrupt. This is incorrect timing, the VERA hardware will trigger the vblank interrupt at the proper start of vblank. (Also: Crud! I keep forgetting about this.)

What this means is that the emulator isn't giving you as much vblank time as you should have. Since line IRQs are also broken on r37, there is presently no workaround.

I should check whether there's an issue about this on Github yet... and make one if there isn't. It would be a good first issue for someone interested into contributing.

I'm having a little trouble picturing that in my head. When I look at a diagram of the front/back porch, it seems like your terms are reversed, or I'm just not getting it...

image.png.6615b73d124acf9f0fe8281eb841e9a3.png

If I'm reading this correctly, the front porch is at the end of the frame, and the back porch is at the beginning. (This is backward from common sense,  but that's because the porches surround the sync pulse, not the video information.)

So I think I understand what you're saying, but if so, I think the terms you're using are backward. 

  1. Trigger Vertical Sync
  2. Begin Back Porch
  3. End Back Porch
    1. Clear vblank
    2. Reset line counter
  4. Draw 480 lines
  5. End of visible frame
    1. Trigger VBlank Interrupt
    2. Assert vblank
  6. Begin Front Porch

https://forums.blurbusters.com/viewtopic.php?t=3248

This makes me think of another question: will the line counter continue into vblank? Or does it stop when the vblank starts (holding at 480)?

 

 

Share this post


Link to post
Share on other sites
  • 0

I very likely have front porch/back porch reversed. I'm relatively new to this stuff as well!

@Frank van den Hoef would have to answer your question about whether the line counter continues into vblank. The VERA emulation in r37 won't trigger the line IRQ after drawn lines, but if you build your own from the current github code (which has fixed the line IRQs), you can set a line IRQ on the last visible line and get the approximate timing of when the vsync interrupt should occur.

Share this post


Link to post
Share on other sites
  • 0

My game world - or what it should be called - is 256x256 tiles but I can’t have a tilemap that big. VERA can handle it but it would take 128 KB. Instead I use a 32x32 tilemap where the topleftmost 21x16 tiles (=336x256 pixels) form a window that shows the part of the 256x256 map that currently should be displayed. This means I can scroll 15 pixels but then have to update the whole 21x16 window. To avoid this I can instead choose to just add a tile column or row (depending on the direction I am scrolling) to the 32x32 tilemap and let the tilemap wrap around when reaching the end. But that involves considerably more code that also gets rather complicated. One reason for this is that I scroll the map in any given angle. So I am eager to see if I can stay with the current solution even if It’s time inefficient.

Share this post


Link to post
Share on other sites
  • 0

I don't think that updating the entire screen worth of tiles is the right thing to do. Letting the tile map wrap around is. I don't see why it would be so complex, it's all very simple arithmetic, every multiplication and division is with some power of 2. When you scroll past one row, you simply copy new data into the next available row. The same goes for columns. In fact, those two things you can just do one right after the other. If both conditions are true, only one tile (2 bytes) would be written twice. And the best thing is, you can do the copying any time since you're updating tiles that aren't visible on the screen.

Share this post


Link to post
Share on other sites
  • 0
Posted (edited)
14 minutes ago, Guybrush said:

I don't think that updating the entire screen worth of tiles is the right thing to do. Letting the tile map wrap around is. I don't see why it would be so complex, it's all very simple arithmetic, every multiplication and division is with some power of 2. When you scroll past one row, you simply copy new data into the next available row. The same goes for columns. In fact, those two things you can just do one right after the other. If both conditions are true, only one tile (2 bytes) would be written twice. And the best thing is, you can do the copying any time since you're updating tiles that aren't visible on the screen.

That doesn't work - you can only scroll so far using that method. If you kept scrolling left, for example, you'd eventually walk right out of address space in VRAM. Eventually, all of the data on the screen needs to be moved.

@Johan Kårlin

The problem is that this simply can't all be done in vblank: With a 32x32 map, this is 1K of data. Since it takes roughly 20 cycles to move one byte of RAM (Read, Write, Count, Test). Including setup work for each row, you're talking about 24K cycles.

However, this still takes far less than one 60Hz frame. What I'd suggest is using two pages: instead of copying the scrolled data to a new location in the same buffer, copy it all to a second buffer, offset by 1K from the first one. This not only allows for a clean swap (which CAN be done on vblank), but you have the advantage of scrolling any direction without needing to worry about which direction you're copying data.

So each time you scroll more than 1 tile's worth of pixels, you swap your display and back buffers. 

 

 

 

Edited by TomXP411

Share this post


Link to post
Share on other sites
  • 0
6 hours ago, StephenHorn said:

I very likely have front porch/back porch reversed. I'm relatively new to this stuff as well!

👍 I just wanted to make sure I understood what was happening. 😃

 

Share this post


Link to post
Share on other sites
  • 0
19 minutes ago, TomXP411 said:

That doesn't work - you can only scroll so far using that method. If you kept scrolling left, for example, you'd eventually walk right out of address space in VRAM. Eventually, all of the data on the screen needs to be moved.

That's most certainly not true.

The tile map wraps around when scrolled. So, when you're nearing the left edge of the tile map, you simply load your data into the next column to the left. Since you're already at the left edge, you load your data into the rightmost column. Same goes for any direction. You just need to keep record of what portion of your big map you're currently showing.

For instance, if you have a big map, say 256*64 and a tile map of 32*16, then col 0 of you map would be copied into col 0 of the tile map, 1 into 1, etc. Col 32 of your map would be copied into col (32 bitwise-and 31) = col 0 of the tile map. Same goes for rows.

Share this post


Link to post
Share on other sites
  • 0

Huh. I'll have to take another look at that when the new version of the emulator hits. The last time I played with VERA registers, it seemed like scrolling too far pushed stuff down a line... but the number of changes and amount of time since the last time I played with it may have colored my memory. 

Share this post


Link to post
Share on other sites
  • 0

If the emulator wraps around without pushing everything down a line - then we can update just one row off-screen. That would be neat. I still have to implement my first scrolling, looking forward to that.

Share this post


Link to post
Share on other sites
  • 0
17 hours ago, StephenHorn said:

I very likely have front porch/back porch reversed. I'm relatively new to this stuff as well!

@Frank van den Hoef would have to answer your question about whether the line counter continues into vblank. The VERA emulation in r37 won't trigger the line IRQ after drawn lines, but if you build your own from the current github code (which has fixed the line IRQs), you can set a line IRQ on the last visible line and get the approximate timing of when the vsync interrupt should occur.

The line counter is active at all times. The (9-bit) line counter is increment by 1 in progressive mode, and by 2 in interlaced mode. In progressive mode a line IRQ is produced when the 9-bit line number matches the 9-bit line irq register. In interlaced mode a line IRQ is produced when the upper 8-bits of the 9-bit line number matches the upper 8-bit of the line irq register.

The video timing has the active portion at 0-479,  then 10 lines of front porch, 2 lines of v-sync and 33 lines of back porch, for a total of 525 lines. Rendering is started one line earlier, so at line 524. Composing the outputs of the renderers (2 layer renderers and 1 sprite renderer) is performed while outputting the pixel data. Eg. the line buffers rendered at line 524 are displayed (and composed) on line 0.

Hope this helps.

  • Like 1
  • Thanks 3

Share this post


Link to post
Share on other sites
  • 0
I don't think that updating the entire screen worth of tiles is the right thing to do. Letting the tile map wrap around is. I don't see why it would be so complex

 The full story is that in order to manage the 256x256 game world map, I use what I call blocks consisting of 8x8 tiles. The world is first defined by a 32x32 blockmap. Each block has its own 8x8 tilemap. In other words, an arbitrary pixel position has a position in a tile that has a position in a block that has a position in the blockmap. I suppose what’s complicated is a matter of preference, but I think this makes it a bit cumbersome to get hold of which tiles to actually draw on the tilemap, regardless if it’s about updating the whole tile window or just a column or a row. As I mentioned, another complexity is that I scroll in any degree (0-360). This means that updatng is about a top row, a bottom row, a left column, a right column or both a row and a column or a combination where one or both row/column has/have to be shifted a tile in either direction. 40 degrees and similar directions are the hardest. Then you have to keep track of when the 32x32 tilemap wraps around and when the 256x256 world map wraps around. Another thing is about debugging. If you’re lucky you can figure out what’s wrong by just studying what is displayed oj the screen. But often I have had to use the debugger, and when debugging I am really happy if I can get rid of some nested loops and levels of abstraction.

 

BUT I have actually implemented this alternative and made it work apart from one bug related to when the world map wraps around. It was when I was trying to fix that bug I began to think that I might be overdoing this. If I have both CPU time and memory left, why not spend it? My current solution is about half the code size and I find it much easier to follow and debug. I (1) update the whole screen, (2) scroll until the end of the current tile, then start over again. There is one subroutine for both initializing the screen and updating it.

Of course, if in the end this is to time consuming, it is of no use. That’s what I am trying to figure out. I have realized that I will never have time to copy enough memory at the time of vblank thanks to several contibutors to this thread, but the double buffering that [mention]TomXP411 [/mention] suggests should work! Thanks! I can easily afford the size of an extra 32x32 tilemap and the code gets even simpler. I can skip the intermediate tilebuffer and write to the currently not displayed tilemap directly.

Share this post


Link to post
Share on other sites
  • 0

Just for kicks, I was reviewing the code snippet that you provided in your question, as a way to brush the cobwebs off of my ancient 6502 thinking. Please correct me if I am wrong, but if the location of the tile buffer is on an even memory address (lower bit is zero), then the first "bne +" instruction can be omitted, since it would always branch. Along with that, you would remove the "inc ZP1", because it would never be executed. Doing this would save 4 clock cycles in every inner loop (multiplied by 16 and then by 21). If I am correct, you can save 1344 clock cycles just by removing that "bne" instruction. Do you agree, or are my cobwebs still too thick?

Share this post


Link to post
Share on other sites
  • 0
On 8/5/2020 at 2:45 PM, Johan Kårlin said:

Of course, if in the end this is to time consuming, it is of no use. That’s what I am trying to figure out.

I realize that you determined a better way to do things (i.e., to use double buffering), but I was wondering whether you got a chance to see my comment about your sample code, and the "bne/inc" instruction pair. Do you agree with my evaluation?

Share this post


Link to post
Share on other sites
  • 0

Oh, sorry, I missed your comment. Thanks for taking time to read the code! But no, it will not work. I think you interpret BNE as testing if a number is even or not. It is only testing if it is zero. What I basically do is adding 1 to a 16 bit number (in this case an address pointer). This can be done most easily by increasing the lower byte all the way until it wraps from $ff to $00 at that point you increase the high byte and start over again with the lower byte. Do I understand you right? I am also trying to get a grasp on 6502 assembler, it’s been a while... and I haven’t really begun to learn until now.

Share this post


Link to post
Share on other sites
  • 0

pastblast meant that, if the starting address is even, then it always needs an even number of increments to make the lower byte overflow.  (If the address starts at $FE, then one increment cannot overflow -- only two increments can do it.)  Therefore, the odd-numberred increments don't need to be tested.

Another -- even bigger -- improvement in speed can be gained by using an indexing address mode.  Registers are incremented faster that zero-page memory.

;Update top leftmost 21x16 tiles of 32x32 tilemap.
        lda #<_tilebuffer       ;set tilebuffer pointer
        sta ZP0
        lda #>_tilebuffer
        sta ZP0+1

        lda #<L0_MAP_ADDR       ;set tilemap pointer
        sta VERA_ADDR_L
        lda #>L0_MAP_ADDR
        sta VERA_ADDR_M
        lda #$10
        sta VERA_ADDR_H

        ldy #16                 ;16 rows
        sty COUNT

        ldy #$00
--      ldx #21                 ;21 columns of 2 bytes each

-       lda (ZP0),y             ;copy tile from buffer to tilemap
        sta VERA_DATA0          ;write first byte of column
        iny
        lda (ZP0),y
        sta VERA_DATA0          ;write second byte of column
        iny
        bne +
        inc ZP0+1
+       dex
        bne -

        clc                     ;add (32-21)*2 = 22 bytes to get addr. for next tilemap row
        lda #<22
        adc VERA_ADDR_L
        sta VERA_ADDR_L
        lda VERA_ADDR_M
        adc #>22
        sta VERA_ADDR_M
        dec COUNT
        bne --

Share this post


Link to post
Share on other sites
  • 0

You're absolutely right! Finally I am getting it! And indexing makes it even better.  Many thanks to both of you! This is exactly why I included the code, to see if someone could point out the weak points. I am not using this exact code anymore but use the same slow code in other places. Good to know it can be optimized. When updating the whole screen I have about half of the available CPU time left. No problem at this point but if the X16 ends up with a clock frequency of 4 MHZ I might run into trouble...

Edited by Johan Kårlin

Share this post


Link to post
Share on other sites
  • 0
6 hours ago, Greg King said:

pastblast meant that, if the starting address is even, then it always needs an even number of increments to make the lower byte overflow.  (If the address starts at $FE, then one increment cannot overflow -- only two increments can do it.)  Therefore, the odd-numberred increments don't need to be tested.

Another -- even bigger -- improvement in speed can be gained by using an indexing address mode.  Registers are incremented faster that zero-page memory.

 

Greg, thanks for helping to clarify the point that I was making! 👍

Johan, I think you now understand that I know how BNE works (tests for zero, not for odd/even). 😉

Share this post


Link to post
Share on other sites
  • 0
4 minutes ago, Johan Kårlin said:

I am sorry that I underestimated your knowledge. I am the grateful learner here. emoji4.png

Hey, no problem at all. I'm happy to help. In my 40 year career (thus far), about 25 years is in embedded systems, on many different CPU platforms (Z80, 8088, 6809, 6502, 80186, AVR, ARM, HC11, MicroBlaze, etc.) I have written many PC and cloud apps in various languages (and created several custom languages), but I like small-platform embedded stuff the best, because I like having control of the entire machine, and figuring out how to do things with limited resources. That's why this Commander X16 project interests me. It brings back good memories!

  • Like 1

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Answer this question...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...

Important Information

Please review our Terms of Use