• 0

# Average amount of 6502 instructions per scan line, does my calculation make sense?

## Question

Hi all,

I want to validate a calculation on how much instructions are processed by the 6502, every time the vera scans a display line.

Let's assume that the average instruction processing time is 5 cycles. (It might be an overestimation, depending on the type of instructions used). But let us imagine 5, as there is a program that uses a lot of indirect indexed addressing, like LDA (\$ZP),Y ...

I've understood that the CX16 works at 8Mhz, so translating this, that would be around 8.000.000 cycles per second, correct?
The vera display is configured to operate at 640x480 resolution, and no scaling. Let's assume that the vera scans the display 60 times per second.

So the average amount of instructions that would be processed by the 6502 would be 8.000.000 cycles per second / 5 cycles per instruction, which would be 1.600.000 instructions / second.
The vera scans the complete display 60 times per second, so that would mean that 1.6000.000 instructions per second / 60 vera scans per second would result in 26.666 instructions per single vera display scanning operation. However, there are 480 vera scan lines, so 26.666 instructions per vera display scan / 480 vera scan line would result in around 55 instructions per scan line per 60 times a second.
So if we need to control about 40 objects, that would mean that for each object, there would be about 1,375 instructions per scan line available.

Is the above calculation correct or is it way off? And what other variants or parameters come into play?

Please see the below, it is the reason of my asking ... The white border measures the processing from the start of scanning till the end of scanning.
Each object on the display is processing movement logic, behaviour logic, draw logic ... I was surprised that the machine takes time to process all this logic.
Which makes it excitiing, because it allows to build a relatively more complex game, but still there are boundaries as a result of the machine configuration.

kind regards,
Sven

Edited by svenvandevelde

## Recommended Posts

• 0

It should be more like 50 instructions per scan line.

Your math is a bit off, because you forgot about the refresh interval, which actually adds up to 525 total lines.

So let's work up to the correct answer.

525 scan lines at 60 frames per second. That's 31500 lines per second.

8MHz at 5 ticks per instruction. That's 1.6MIPS

x = (8M / 5) / (525 * 60)
x = 1,600,000 / 31500
x = 50.79

So if my math is right, you can get roughly 50 instructions per scan line on the Commander X16.

• 1
• 1
##### Share on other sites

• 0

I cannot follow the whole calculation. I agree on the 1600000 instructions per second, or 26667 instructions per frame. If you update 40 objects each frame, that would be 667 instructions per object and frame.

As far as I know, there are hblank and vblank intervals, so the instructions per scan line computation is not as straightforward as it seems. I know this because of the box16 emulator, which has a CPU activity visualization with scan line overlay, and if I remember correctly, there were those hblank and vblank periods.

Anyway, you would get less than 2 instructions per scan line and object. So it is not possible to update every object on every scan line. But then again, I don't think that this is necessary. One update per frame should be enough, shouldn't it?

Edited by kliepatsch
##### Share on other sites

• 0
On 1/10/2022 at 8:48 AM, kliepatsch said:

So it is not possible to update every object on every scan line. But then again, I don't think that this is necessary. One update per frame should be enough, shouldn't it?

Of course it is. My intention is indeed to understand with some science behind, how many objects would be able to be updated per frame, or in my terms a complete horizontal vera display scan.

I'm not intending to update the object per vera scan line operation :-).

##### Share on other sites

• 0

I think you are on the right track. You have got 8 million / 60 cycles per frame, that would be 133.333. When doing such estimates, I usually counted cycles, not instructions.

As soon as you know what kind of game logic you want to do for every object, you can start prototyping the most costly algorithms that are required for these updates, and count the cycles that are required for these algorithms. Add a bit of extra, depending on how much logic you intend to have, and you have got yourself a first rough estimate of how many cycles you need to update an object.

For example, suppose you find that you cannot get around doing two multiplications while updating an object. Start prototyping the multiplication routine. Let's say it uses 200 cycles on average, so that you have 400 cycles for the two multiplications needed for each object. Let's say you need a fair bit of extra logic/code for the object update, so let's assume 250 more cycles. So your total estimate would be 650 cycles per object.

This lets me arrive at 133333/650 = 205 objects.

##### Share on other sites

• 0

BTW the screen capture looks quite nice And it's interesting to see the CPU consumption go up and down with appearing and disappearing objects.

##### Share on other sites

• 0

I think 3.5 or 4 is a better cycle count for the average 65C02 instruction. A large portion of your code are going to be 2- and 3-cycle instructions. So, I'd say it's more like 60-70 instruction per scan line. Of course, mileage varies, and if you need to use a bunch of 6-7-cycle instructions, you'll run out real quick.

• 1
• 1
##### Share on other sites

• 0

It's probably better to eschew the "instructions" count, given that the underlying reality is clock cycles and not instructions. Using Tom's numbers, you come up with ~254 clocks per scanline. I try to avoid using (ZP),Y addressing where feasible, and where not, I also tend to use page-aligned pointers in ZP because of the page-boundary-crossing penalty / slower check for non-zero Y exit value.

I don't have all of the clock cycle counts memorized for the various instructions - I just bear in mind what the CPU does and that helps remember:

implied-mode instructions like inx, rol, sei, etc.... those are all 2 clocks. Essentially one clock to read the instruction, and one clock to do it.
1-byte operand instructions tend to be 3 clock minimum. 2 clocks to read the instruction and operand, and one to do it. 2-byte operand instructions are essentially take 4 clocks: 3 to read the opcode + 2-byte instruction, and 1 to do it.

So then the (ZP),Y mode makes sense why it's 5 or 6 clocks to execute: It has to read 2 bytes to get the instruction, then it has to read 2 bytes to get the address it needs to read from, and if ZP+Y causes an overflow, it takes an extra clock to inc the page portion of the target address. Then it takes 1 clock to actually get the byte from memory into the accumulator. Hence 5 or 6 clocks to execute.

• 1
• 1
##### Share on other sites

• 0
On 1/10/2022 at 7:54 AM, ZeroByte said:

It's probably better to eschew the "instructions" count, given that the underlying reality is clock cycles and not instructions. Using Tom's numbers, you come up with ~254 clocks per scanline. I try to avoid using (ZP),Y addressing where feasible, and where not, I also tend to use page-aligned pointers in ZP because of the page-boundary-crossing penalty / slower check for non-zero Y exit value.

Yeah, this is really the smart thing to do. Count your cycles and make special note of the "+1" instructions that take an extra cycle when the operation crosses a page boundary. Those get more expensive when the upper byte of the address changes after adding in X or Y.

For example, LDA addr,X is a +1 instruction.

If you execute LDA \$1020,0, this will take 4 cycles, but if you  execute \$1020,\$F0, you're adding the extra cycle. This happens when the upper byte of the address changes. In this case, \$1020 + \$F0 = \$1110. The upper byte has changed from \$10 to \$11, forcing the CPU to spend an extra t-state to update the internal data address register.

##### Share on other sites

• 0

The spec for 640x480 60Hz VGA is exactly 800 clock cycles at 25.175 MHz per scanline.

The math from there is pretty much as @TomXP411 says: there are about 39.72 nanosecond per clock, ~31.78 microseconds per line, ~254.2 cycles of an 8 MHz clock per line, which works out to ~50.84 "average instructions".

• 1
• 1
##### Share on other sites

• 0

Thank you all for the massive feedback. Some real experienced people here

I will need to do some more code crunching to optimize my logic to get more objects processed.

Thank you for your suggestion to count in cycles. Indeed, the instructions used to build a logic varies game per game.

One big lesson that I learn out of all of this is that although the x16 has 8mhz it is still a machine with limitations. The prospect of managing 128 sprites in a fluid process really requires a thoughtful design and a coding approach that requires a lot of experience. On top of that the background scrolling logic, collision detection, AI behaviour, player control, scoring, event management adds complexity.

Beside the logic to control the sprites,  I'm the most afraid of the complexity of the collision detection. The more objects you need to control in the flow, the longer it will take to identify which objects are colliding. Even taking into account that the vera can help with the collision detection events for the specified sprites displayed on the screen.

I really would like to design this game using c language. My biggest challenge is how I can use absolute indexed addressing in the 6502. I mean, in a complex game flow the objects and memory locations move around and invite for the usage of pointers. Which results in indirect indexed addressing.

I could write a game engine using absolute indexed addressing, but then I would need to copy memory areas around and avoid pointers where I can, which also adds CPU time.

Or I would need to follow the route of self modifying code, but this one I really would like to avoid.

Not so sure what is the right way to go. I guess the solution will be in the middle. And that makes it fun, to design a software that is portable, fast and designed for speed where needed. Another boundary I'm about to hit is the code size...

Edited by svenvandevelde
##### Share on other sites

• 0
On 1/10/2022 at 10:32 PM, svenvandevelde said:

One big lesson that I learn out of all of this is that although the x16 has 8mhz it is still a machine with limitations. The prospect of managing 128 sprites in a fluid process really requires a thoughtful design and a coding approach that requires a lot of experience. On top of that the background scrolling logic, collision detection, AI behaviour, player control, scoring, event management adds complexity.

Number of sprites definitely puts load on the program even when the video card can easily accomplish it. Consider that the sprites' X/Y/"frame" information is 5 bytes. For 128 sprites, this alone is 2 1/2 pages of memory that you need to update into VRAM on every frame (potentially) - If you only update the ones with changes, that reduces the amount of data to be moved, yes, but it increases the complexity of the algorithm, which now needs to evaluate data on each pass to determine which registers need to be updated and which don't - this is usually more expensive than just blitting the sprite register updates into the registers whether they've changed or not.

• 1
• 1
##### Share on other sites

• 0
On 1/11/2022 at 5:31 PM, ZeroByte said:

For 128 sprites, this alone is 2 1/2 pages of memory that you need to update into VRAM on every frame (potentially)

Indeed! Spot on! It is a real challenge for the CX16 to do that! I've seen a couple of demos where people float some sprites around but without a lot of change or processing behind.
When you build a logic that changes the sprites per item individually with their own logic, that is a completely different story!

I need to work on my scrolling engine. By the way, the tiles are placed with a randomization algorithm, no game has the same tile sequence :-).

• 1
• 1
##### Share on other sites

• 0

From the looks of your rasterbar at the sides of the screen, you're doing fairly well CPU-wise except for a few frames where it takes the whole screen's worth of raster time... is that an artifact of the GIF creation, or is it really hitting some super-expensive snag every once in a while?

Also - is this done in C or ASM? Given the thread topic, I'm guessing ASM...

Edited by ZeroByte
##### Share on other sites

• 0
On 1/11/2022 at 9:11 PM, ZeroByte said:

From the looks of your rasterbar at the sides of the screen, you're doing fairly well CPU-wise.

Well, it can be improved. The tile algorithm is generating for a whole tile row in a new random sequence (building further upon the previous drawn tile sequence), every 32 scan lines
Walls need to fit with walls and open space can be closed with walls or can stay open, etc, etc ... Facinating stuff to learn and to code...
The algorithm takes quite some CPU actually at the moment, because a "row" is actually a combination of "logical tiles", which are a combination of 4 * 4 * 4 vera tiles.
The layer on which the background is drawn, is actually in 16x16 tile mode in 4bpp.
So per tile, there is a tile segment, which is 4 * 4 tiles, which makes a 32x32 tile. A whole "tile" is actually then a combination of 4 tile segments.
And it may not show, but the algorithm works with about 60 background tile "bitmaps", and the combinations of those create about 56 tile variations of 4x32x32 width and height.

You notice that there is sometimes a white "flash" in the border, which is the drawing of the row. It draws a complete row of 2x32x32 tiles (so half of a complete tile), and on top it draws the row the full length, which is 64 vera characters.
This will alllow me to design also a panning to the left and right, depending on the player steering. So the enemies can get out of screen focus for a while if the player flies left or right.

I really need to do some code crunching now =-). If I can tweak out a couple of cycles, that gives room for more effects or other weapons etc.
Also plan to have "towers" or ground installations in the background, that would fire also or would defend together with the player the incoming enemies.
Those would be sprites that would be layered on top of the floor plan, but under the enemies and the sprites. And those would fire weapons, or have some sort of magnetic fields or something like that.
The imagination only limits the ideas. It is a lot of fun to make this.

Sven

Edited by svenvandevelde
##### Share on other sites

• 0
On 1/11/2022 at 2:31 PM, svenvandevelde said:

Also plan to have "towers" or ground installations in the background, that would fire also or would defend together with the player the incoming enemies.
Those would be sprites that would be layered on top of the floor plan, but under the enemies and the sprites. And those would fire weapons, or have some sort of magnetic fields or something like that.
The imagination only limits the ideas. It is a lot of fun to make this.

Sven

Most classic games seem to do this kind of enemy as tiles in the BG layer. That would certainly cut down the CPU time required to keep track of them and move them along with the scroll....

##### Share on other sites

• 0
On 1/11/2022 at 9:35 PM, ZeroByte said:

Most classic games seem to do this kind of enemy as tiles in the BG layer. That would certainly cut down the CPU time required to keep track of them and move them along with the scroll....

True. I might do a combination of things. A larger tile in the background, and then an animation of the gun on the top of it ... But it will take time for me to program it, and also draw the tiles and the sprites. But I guess this is part of the fun, isn't it?

##### Share on other sites

• 0
On 1/11/2022 at 9:11 PM, ZeroByte said:

Also - is this done in C or ASM? Given the thread topic, I'm guessing ASM...

Nope. Pure C code ...

##### Share on other sites

• 0
On 1/10/2022 at 11:32 PM, svenvandevelde said:

... One big lesson that I learn out of all of this is that although the x16 has 8mhz it is still a machine with limitations. ...

Oh my goodness yes ... remember, there's a reason why, when IBM got the clock count up to 8MHz with a 16bit 286 processor in back in 1984 ... the computer industry didn't say, "well, 8MHz will do 'er" and stop there.

##### Share on other sites

• 0

In my 4096 colour demo I ran into exactly the same question.

To achieve this the demo has to update 64 colours of the 256-colour palette using the raster interrupt every 4 scanlines. This should be finished by the time the next scanline is shown on screen. But of course there just isn't enough instructions available. I solved this by adding black spaces in between the lines, so the demo actually has 3 full scanlines available. And that's just about enough time to change 64 colors in assembler using a predefined lookup table and the VERA auto-increment feature.

But hey - it shows all 4096 colours on one screen .

##### Share on other sites

• 0

Been working to optimize the code. It is very funny.

This thing does dynamic tiling of the background based on tile sets in 256 colors, it controls the sprites, it controls the player plane and the bullets, and it controls the smooth scrolling.

Especially for the smooth scrolling I've made an algorithm that copies every "row" of 16 pixels accross the invisible part on the map, so that the base of the visible part of the map can be flipped from row 0 to row 31 without the player perceiving this flip. A lot of stuff optimized in the background. The white bars show the CPU usage per frame!

I am now looking for a very optimal way to perform collision detection.

##### Share on other sites

• 0
On 1/23/2022 at 11:44 AM, svenvandevelde said:

This thing does dynamic tiling of the background based on tile sets in 256 colors, it controls the sprites, it controls the player plane and the bullets, and it controls the smooth scrolling.

I'm very impressed at what you've been able to accomplish. I look forward to playing the game someday, perhaps even examining the code to see how you've created such a result.

• 1
• 1
##### Share on other sites

• 0
On 1/23/2022 at 8:44 PM, svenvandevelde said:

I am now looking for a very optimal way to perform collision detection.

For this I don't want to build a loop that checks on collision let's say ever bullet in the air with every sprite in the air. So if there are 15 bullets in the air with 20 enemies, that would mean 300 iterations checking the bounding rectangles. Even when using hardware collision detection this many iterations can occur.

I'm thinking of approaching it using "quadrants".

When the objects are moved over the map, a quadrant matrix can help where each quadrant can be filled out. This would limit the search to those objects in the quadrant where the collision happened. Something like 16 quadrants.

Of course, this makes the cpu load of moving the objects more heavy but it will make collision detection faster.

An other option is to implement an R tree  however this is complex and will increase the code base. Or linked lists. Not sure what approach to take...

##### Share on other sites

• 0
On 1/23/2022 at 9:07 PM, Edmond D said:

I'm very impressed at what you've been able to accomplish.

Very kind of you, thank you. It is only in progress and am facing many walls of limitations. But that makes it fun. Limitations like CPU, memory. 256 and/or 16 bit graphics eat memory. Especially when animations of sprites or tiles are involved. Moving many sprites is very CPU intensive, but also changing sprite offsets is CPU intensive. The VERA is limited in memory but also in speed of its memory access. But the VERA also has strengths. The panning or scrolling of the screen over a defined map helps. It has a very nice design. Pity that the VERA does not support alpha channels in the palette. It does have 4 bits of space for it in the palette. I'm sure @Frank van den Hoef has considered it, but it might have been too CPU intensive to paint the screens accurately with transparency levels.

Anyway ...

If you want I can package it for a try out in the download section with source code. No issue. Only not this evening since I'm already in bed with mobile on. Somewhere next week I'll do a post in this threat with the material.

Edited by svenvandevelde
##### Share on other sites

• 0
On 1/23/2022 at 8:44 PM, svenvandevelde said:

I am now looking for a very optimal way to perform collision detection.

I've used hardware collision detection it in "Invaderz" and it works perfectly well. I have a similar amount of objects on screen. Now I only have to do the check once an object actually was hit, which works even when doing it in C. You can assign each sprite to to groups and then get notified at collisions between groups (simplified). I also ordered the sprites in Y-Axis so the check then started at the optimal position. I've also divided the screen into bigger areas like you suggested to rule out sprites right away.

Be aware that collision detection might behave slightly different in interlaced modes. But as my games are in 320x240 resolution that is not really an issue.

Edited by AndyMt
##### Share on other sites

• 0
On 1/24/2022 at 8:27 AM, AndyMt said:

You can assign each sprite to to groups and then get notified at collisions between groups (simplified). I also ordered the sprites in Y-Axis so the check then started at the optimal position.

How did you order the sprites at the Y position, using some sort of list?

I've been reading some facinating material on spacial grids, which kind of allows to subdivide the screen in grids. They propose to use a binary spacial mask system for an X and Y axis grid, which are properties of entity data. They claim that while the sprites move over the screen, it is then very easy to filter which sprites are in which grids by performing AND operations on the X and Y grid location bits of each entity. However, to maintain and especially to build such a grid system on the X16 might be challenging, because of the CPU overhead while moving the objects.

For example, if in my game I would create a grid of 16 cells on X and 16 cells on Y axis, then with 2 word size spacial masks I could document for each entity the grid it is in. And while moving the object over the screen, I could build a small system to update the X and Y axis grid positions. Then when a hardware collision would happen, I could then do a double loop over the entities, but I would only need to test the AABBs where the objects would be together in the same grid spacial mask. This might be performant, because i don't need to sort data or dynamically allocate data. I just update these space maks. There won't be more than let's say 40 to 50 objects active at the same time. And the hardware accelleration would be able to specify which groups are relevant in the collision, so with some smart planning, the objects to be tested could depend on these 4 groups ...

Here is some reading material with some code examples ... I really got inspired by all this...

Some good material by a guy called 0FPS  -- Collision detection (part 1): Overview – 0 FPSCollision detection (part 2): Box intersection – 0 FPSJanuary 2015 – 0 FPS

A post i found with the idea as described above -- 2d - Using uniform grids for collision detection - Efficient way to keep track of what a cell contains - Game Development Stack Exchange

A very good article with a live demo in java script -- Broad Phase Collision Detection Using Spatial Partitioning - Build New Games

An article on stack exchange  -- java - Making an efficient collision detection system - Game Development Stack Exchange

There are much more of such articles but the more I read about these things, the more facinating I start to find it all ...

Edited by svenvandevelde

## Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

×   Pasted as rich text.   Paste as plain text instead

Only 75 emoji are allowed.