Jump to content
Sign in to follow this  
AstronautSurfer

6502 RISC instruction set running at 3.4ghz..

Recommended Posts

Posted (edited)

Could you imagine a processor running the 6502 (or similar) instruction set at 3.4ghz or any other modern CPU speed?

I can't help but wonder if that wouldn't just run circles around Intel and AMD processors.

I could be wrong.¬† But it sure is fun to think about (especially multi-core) ūüôā

 

Edited by AstronautSurfer
  • Like 1

Share this post


Link to post
Share on other sites

The biggest problem I think would be memory access time. Modern CPUs have complex fetching and caching schemes to pull lots of memory into the CPU at one time, and lots of units running in parallel to keep the CPU busy at all times. Whenever an x86 CPU has to access RAM it has to slow down potentially to wait for the memory request to be fulfilled.

This is the primary reason why a 1 MHz 6502 was comparable to a 4.77 MHz 8088 for certain types of processing. If you could keep an 8088 busy with data already loaded into registers, it would potentially be a lot faster than a 6502 (depending on the instructions), but any time it had to go to memory it took 4 cycles, and since it was a 16 bit CPU with an 8 bit bus, it took 8 cycles to load a word.

I would love to see a super fast 6502, but since the model requires a memory access (or more) with most instructions, it will always be limited to the speed of RAM access. Now, modern RAM can be accessed very quickly compared to the Good Old Days, but each access is (as I understand it) accessing 64 bits at one time. So the interface between the CPU and RAM has to be able to deal with the bits per access. A 6502 is built around 8 bit bytes, so either the CPU has to be rearchitected to deal with more bits per memory access, or a shim interface would have to be inserted between the CPU and RAM to mask out just the bits of interest, which would slow things down.

http://forum.6502.org/viewtopic.php?f=1&t=6049 is a forum post that talks about theory and practice of what 6502 architecture speeds have done. Most notably is the quote (if accurate):

Quote

Bill Mensch (WDC's owner) said in an interview [...] that he estimated that with the newest technology of the day [circa 2017], [6502] could probably hit 10GHz.

So it isn't thought impossible by experts, but someone has to want to do it to make it happen. Clearly the market for it isn't there or else it would have been done already (most likely).

  • Like 2

Share this post


Link to post
Share on other sites

Surely you could put the entire zero page in the processor as a register file, have rotating 64bit instruction cache, a 64 bit read cache and a rotating 64bit write cache ... though for non-sequential writes you will often have to do a read then a write ... and crank up the CPU until it is running so fast it is always waiting for memory ...

... and that speed where it is always waiting for memory will be somewhere short of the maximum feasible speed.

The 6502 was designed as a memory bound processor with an instruction set that required very few transistors but still allows you to get something useful done. Also to avoid violating Motorola IP like the 6501. It had its role in home computing because it was the cheapest way to do things at the time.

Now, the version where there is 128K on cpu cache with the Low RAM all in on-cpu cache and the High RAM and ROM segments starting to be cached as soon as the banks are selected ... that version would let you crank the cpu speed a bit higher before it gets memory bound.

Share this post


Link to post
Share on other sites
3 hours ago, AstronautSurfer said:

Could you imagine a processor running the 6502 (or similar) instruction set at 3.4ghz or any other modern CPU speed?

I can't help but wonder if that wouldn't just run circles around Intel and AMD processors.

I could be wrong.¬† But it sure is fun to think about (especially multi-core) ūüôā

Not really. Modern AMD64 processors are much more clock-efficient than the 6502 was, and Intel CPUs have been Superscalar since the 90s. A Superscalar CPU is capable of executing one instruction per clock cycle, which is about as fast as you can make a CPU at any given clock speed. 

If one was to assume MOS continued development of the 6502 and built a 32-bit and 64-bit chip, it would have ended up taking basically the same development path either Intel or ARM took: ARM trended toward RISC designs and low power CPUs (which is why virtually all cell phone CPUs use ARM processors), and Intel trended toward larger dies and more parallelism, which is why we have 20 execution units on a Core i9.

In fact, going back to 1985 or so... the 6502 only appeared to run faster than its competition, because it cheats: the 6502 splits the clock internally into two phases (Phi A and Phi B), and it further splits each of those phases on half and does certain things on the front half and back half of the phase. While it worked at 1MHz and 2MHz clock speeds, I suspect this is unsustainable at higher speeds. You simply can't jam 3 or 4 T-States into a single clock tick and expect all of the disparate parts of a system to stay in sync at speeds of hundreds or thousands of megaHertz. 

 

 

Share this post


Link to post
Share on other sites

Well, with DDR5 promising 4.8GHz and up to 8.4GHz clock speeds, I guess I could start to believe in the possibility of a high-quality manufactured ASIC hitting the GHz range while running the 6502 set of opcodes, without having to be throttled by memory fetches. Would it run circles around modern Intel and AMD processors, though? Lulz no, we're talking about an 8-bit CPU that can't even do its own integer multiplication, much less floating point, and let's not even start on division. Even if you could clock it up to modern CPU speeds, your programs to compare against would be spending thousands of cycles performing operations that a modern Intel, AMD, or ARM cpu can do in as few as 30 cycles. And the memory efficiency of the program code would suffer greatly, as well, as the 6502 has no opcodes for floating point math, or different type sizes, or vector operations.

Sorry, we're all fans of the 6502 here, but there's just no reality where a 6502-circa-2022 is going to remotely compete on performance.

Share this post


Link to post
Share on other sites

Also: https://en.wikipedia.org/wiki/Instructions_per_second

So if we extrapolate from that list, bumping a 1 MHz 6502 to 3.4 GHz, we would get a 6502 running at 1462 MIPS. Compare that to a Intel Core i7 2600K from 2011 (the most recent CPU in the list running at 3.4 GHz) which achieves 176,170 MIPS. So roughly two orders of magnitude faster.

Of course, that is a 4 core CPU, so divide the MIPS by 4 to get 44042 MIPS for a single core. That means that the Core i7 is 30 times faster, core for core, than a 6502.

Interesting note is that the Raspberry Pi 2 is running at about the same MIPS per core at only 1 GHz, roughly 1/3 the speed of the theoretical 6502 imagined.

  • Like 1

Share this post


Link to post
Share on other sites

The 6502 running at 3.4 GHz would be significantly slower than modern CPUs. You can check out this article where someone manages to emulate the 6502 at speeds equivalent to 15 GHz on a modern Intel CPU running at a clock speed less than a third of that. So modern CPUs do a lot more in a single clock cycle than the 6502.

https://scarybeastsecurity.blogspot.com/2020/04/clocking-6502-to-15ghz.html

The 6502 is also not very RISC-like. Its whole design assumes that accessing memory is fast. Something like the zero page, if not cached, would be horribly slow in the world of modern memory, which is not competitive with CPU registers for speed at all.

  • Like 1

Share this post


Link to post
Share on other sites

So, then, let's drive this discussion in another direction. Assuming that, clock-for-clock, the 65C02 wouldn't be 'competitive,' would a max-speed chip nonetheless still be... 'acceptable' for modern-day tasks? Could you watch a 144p YouTube video? Play a 128kbps MP3? Render a modern web page with scripts and CSS?

The question becomes¬†‚ÄĒ with what RAM? These things depend on huge datasets being able to be accessed and manipulated, often with SIMD instructions, and there's still only the 64K address space, even with the kind of things that the X16 does via paging.

With this kind of sort of primitive MMU style of paging, though, a 65C816 suddenly becomes a much more attractive option. Windows 95 was able to run in the 16MB memory space that an '816 can access, and if the paging system used 4MB pages (scaled proportionally to the size of the X16's ROM pages [or 2MB if you'd prefer it scaled to the RAM page size]), then you've got a sort of supercharged version of the old DOS EMS. You can do some fairly complicated web pages with those kinds of resources.

But without a 32-bit data bus and a hardware multiplier at the bare minimum, you're just not going to be able to move enough data fast enough and get enough arithmetic done to play those compressed media files. Uncompressed (or 'trivially' compressed, like RLE or TMV) is no problem. The X16 can do that if you have a fast enough storage interface that you can keep the buffer full. A 3.4GHz '816-based system could push data fast enough to play 4K HD video at 60p if that's all it has to be doing.

Share this post


Link to post
Share on other sites
10 hours ago, Serentty said:

The 6502 running at 3.4 GHz would be significantly slower than modern CPUs. You can check out this article where someone manages to emulate the 6502 at speeds equivalent to 15 GHz on a modern Intel CPU running at a clock speed less than a third of that. So modern CPUs do a lot more in a single clock cycle than the 6502.

https://scarybeastsecurity.blogspot.com/2020/04/clocking-6502-to-15ghz.html

The 6502 is also not very RISC-like. Its whole design assumes that accessing memory is fast. Something like the zero page, if not cached, would be horribly slow in the world of modern memory, which is not competitive with CPU registers for speed at all.

Yes, as I said, the first step after speeding up the CPU is to have the zero page on the CPU as a register file. Indeed, you'd have the stack page as a register file as well. But LDA (zp),Y is still three memory clocks unless you have an instruction cache and/or data cache.

Share this post


Link to post
Share on other sites
Posted (edited)
On 7/15/2021 at 3:36 AM, kelli217 said:

So, then, let's drive this discussion in another direction. Assuming that, clock-for-clock, the 65C02 wouldn't be 'competitive,' would a max-speed chip nonetheless still be... 'acceptable' for modern-day tasks? Could you watch a 144p YouTube video? Play a 128kbps MP3? Render a modern web page with scripts and CSS?

The question becomes¬†‚ÄĒ with what RAM? These things depend on huge datasets being able to be accessed and manipulated, often with SIMD instructions, and there's still only the 64K address space, even with the kind of things that the X16 does via paging.

With this kind of sort of primitive MMU style of paging, though, a 65C816 suddenly becomes a much more attractive option. Windows 95 was able to run in the 16MB memory space that an '816 can access, and if the paging system used 4MB pages (scaled proportionally to the size of the X16's ROM pages [or 2MB if you'd prefer it scaled to the RAM page size]), then you've got a sort of supercharged version of the old DOS EMS. You can do some fairly complicated web pages with those kinds of resources.

But without a 32-bit data bus and a hardware multiplier at the bare minimum, you're just not going to be able to move enough data fast enough and get enough arithmetic done to play those compressed media files. Uncompressed (or 'trivially' compressed, like RLE or TMV) is no problem. The X16 can do that if you have a fast enough storage interface that you can keep the buffer full. A 3.4GHz '816-based system could push data fast enough to play 4K HD video at 60p if that's all it has to be doing.

No, because the instruction set simply doesn't have the needed operations. 

As was mentioned above, there's no integer divide or multiply, let alone doing floating point math or the SSE instructions that operate on 4 integers at a time. 

Scripts are a maybe, but again, note the performance numbers Scott pulled out above. A 3GHz 6502 would be running at the equivalent speed of a 100MHz x86 - but without the math coprocessor, or even multiplication or division. 

If you think back to the 100Mhz days - yes, you can absolutely run a web browser on a 100MHz computer, but it's going to be much slower than a modern PC, and anything involving fancy math (such as decompressing JPG and PNG graphics) is going to be slooooooooooooooooooow.

 

Edited by TomXP411
  • Like 2

Share this post


Link to post
Share on other sites

The long but interesting interview with Jim Keller on Anandtech comes to mind, specifically this quote:

"JK: [Arguing about instruction sets] is a very sad story. It's not even a couple of dozen [op-codes] - 80% of core execution is only six instructions - you know, load, store, add, subtract, compare and branch. With those you have pretty much covered it. If you're writing in Perl or something, maybe call and return are more important than compare and branch. But instruction sets only matter a little bit - you can lose 10%, or 20%, [of performance] because you're missing instructions."

 

 

Share this post


Link to post
Share on other sites
13 hours ago, TomXP411 said:

No, because the instruction set simply doesn't have the needed operations. 

I'm going to proceed under the impression that your first sentence responds directly to my last sentence. You are probably still correct, but I may not have made it clear that I was talking about completely uncompressed video, or something trivially compressed like Run-Length Encoded or Text Mode Video, where the CPU does not have to do any math to decompress lossy data, but is just pushing pixels or character data. And doing it on this theoretical 3.4 GHz 65C816, with no OS overhead.

Share this post


Link to post
Share on other sites
5 hours ago, kelli217 said:

I'm going to proceed under the impression that your first sentence responds directly to my last sentence. You are probably still correct, but I may not have made it clear that I was talking about completely uncompressed video, or something trivially compressed like Run-Length Encoded or Text Mode Video, where the CPU does not have to do any math to decompress lossy data, but is just pushing pixels or character data. And doing it on this theoretical 3.4 GHz 65C816, with no OS overhead.

Actually this gets to the only point of taking a processor with such a small transistor footprint and speeding it up like that, which is that the bulk of the mask is taken up by some other specialized circuitry.

One problem with that approach is less technical than economies of scale ... making the specialized circuitry so that it goes onto a bus to be driven by an external CPU means that it can be used by more than one CPU, just as the CPU is produced to use more than one specialized circuit.

If you were going to do it anyway, a problem with using the 6502 for it is economies of scope ... since the 6502 instruction set is not well suited for compiled C software, a lot of things have to be built from scratch that wouldn't have to be built from scratch if you used a low-power ARM architecture, and accept the larger mask footprint as the tradeoff for the much greater toolchain support base.

And if you were going to do it ANYWAY, and build your own toolchain, then a stack machine would give you an even smaller transistor footprint with more MIPS than a 6502 of the same speed.

Share this post


Link to post
Share on other sites
23 hours ago, kelli217 said:

I'm going to proceed under the impression that your first sentence responds directly to my last sentence. You are probably still correct, but I may not have made it clear that I was talking about completely uncompressed video, or something trivially compressed like Run-Length Encoded or Text Mode Video, where the CPU does not have to do any math to decompress lossy data, but is just pushing pixels or character data. And doing it on this theoretical 3.4 GHz 65C816, with no OS overhead.

You already have all the information you need to do the math.

1080i60 is the current broadcast standard (although I suspect studios are production video internally at 1080p60 or 3840p60).

1920 x 1080 is 2073600 pixels, or 6,220,800 bytes per frame. That is 186,624,000 bytes per second. It was already stated above that a 3GHz 6502 would run roughly 1.4MIPS.

How are you going to transfer 186 megabytes per second when you can only process 1.4 million instructions?

At this point, the questions you’re asking are making less and less sense, since it’s already been explained that 6502 architecture is decades out of date, and no amount of clock cycles will make it a practical processor for modern desktop computing demands. 

Share this post


Link to post
Share on other sites

 

On 7/10/2021 at 1:28 AM, BruceMcF said:

Now, the version where there is 128K on cpu cache with the Low RAM all in on-cpu cache and the High RAM and ROM segments starting to be cached as soon as the banks are selected ... that version would let you crank the cpu speed a bit higher before it gets memory bound.

This has been in my mind for a few days now and is something I find super interesting. Essentially, placing the "main" memory onto the die as SRAM would let it run at the CPU clock speed. This would include the ZP (so I think solves the need to consider the ZP part of a register file and we can treat it the same as on, say, the X16) as well as the stack. We're well beyond the point of having external SRAM here in order to run at anything remotely close to 3.4GHz or whatever, but given Ryzen has 384kb of L1 cache, this seems in the realm of possibility.

A 6502 running that fast is still going to run into processing limits of 64kb space. In that, it'd churn through data so fast that having such a small amount of memory would be pretty limiting (it already is on the X16, hence the himem stuff). I'm less sure about how to access external data and surely some level of caching or extra hardware to manage, say, modern DDR RAM, would be needed. Fetching data from DDR RAM would take longer than internal memory so, as was alluded to in other comments, there's going to be waiting going on and a need for some additional caching I suspect. I/O as well would have to be solved since on the X16 everything runs at bus speed and it's all good but there's far more complexity on modern systems to manage, say, the PCI Express bus. Even interfacing with our lovely YM2151 or the VERA would have to be very very different here.

Not to mention the design requirements from the physical hardware standpoint (trace lengths for the mainboard, etc. etc.) all become very important at these speeds.

At this point, it kinda breaks down on what the next step would be here. But it is kind of a fun exercise to think about, I thought anyway. I mean there is surely a reason why the T-800 uses a 6502 after all ūüėČ

 

 

 

Share this post


Link to post
Share on other sites
1 hour ago, m00dawg said:

At this point, it kinda breaks down on what the next step would be here. But it is kind of a fun exercise to think about, I thought anyway. I mean there is surely a reason why the T-800 uses a 6502 after all ūüėČ

And Bender! Head canon: With all the issues found with advanced CPU techniques, bugs in CPUs, security issues due to inter-core spying, etc etc etc, they decided to go to the best CPU that would be known secure where any defects were well documented. Yeah, that's it.

  • Like 1
  • Haha 1

Share this post


Link to post
Share on other sites
7 hours ago, m00dawg said:

A 6502 running that fast is still going to run into processing limits of 64kb space. In that, it'd churn through data so fast that having such a small amount of memory would be pretty limiting (it already is on the X16, hence the himem stuff). I'm less sure about how to access external data and surely some level of caching or extra hardware to manage, say, modern DDR RAM, would be needed. ...

Access to the 2MB of L2 RAM cache and 1MB of L2 instruction cache is straightforward ... you put a byte into $0000 in the address space to select an 8MB segment of L2 cache, or 16MB segment of L2 instruction cache, and reading the associated bank window copies the L2 contents into the L1 cache, 64bits at a time.  Writing is more involved (why only one window is R/W and it's the 8KB one) but each write sets a bit in a 1KB written value register, which sets a bit in a 128byte written 64bit word register. When not reading the L2 R/W cache, it scans through the written word register and does a write back of new contents, 64bits at a time, controlled by the written value register.

Each core has its own 64KB L1cache, but the L2 cache is common to all cores, so only core 1 has the memory controller registers in its memory mapped I/O space.

  • Like 1

Share this post


Link to post
Share on other sites

Well if my Apple //c had a 3.4GHz 65C02 maybe it could actually run Flight Simulator 2 at a decent speed. That said, retrospectively the 6502 does things...strangely compared to some of the more modern designs I studied in college (68000 comes to mind), but it was "good enough" to power the most popular computers of all-time. My goal for this year was to learn 6502 assembly, but it's slow going because I keep thinking in a modern context; the entire concept of zero page was alien to me before I realized the intent. And really that sort of stuff is what engineers like to call "getting it done." A theoretical processor that does everything in one clock cycle with a billion megs of registers blah blah blah is fun to think about, but the 6502 was made by engineers with a problem to solve and they did it in a way that was not only effective, but super cheap to mass produce.  I digress, but thinking about this sort of stuff really makes you appreciate the old school design philosophy. Shoe string budget, looming deadlines, no such thing as soft patches or firmware updates...it's gotta work or you don't eat.

The paradigm of programming has shifted so much over 40 years, anyway. Assembly programmers are a rare breed anymore, which is honestly a shame since it teaches you to understand what you're actually doing on a fundamental level and also allows you to fully exploit a system instead of just crossing your fingers that the compiler takes your crappy code and turns it into something quick enough for the job. It's almost like code optimization is a bad word anymore. That said, I don't think a modern compiler would be able to take advantage of the 6502 in a way that would scale linearly with more speed. Obviously I could be wrong, but honestly running GEOS on my Ultimate64 at 20MHz or so seems faster than using my Macbook, so maybe like 100MHz would be a reasonable limit of usefulness?

  • Like 1

Share this post


Link to post
Share on other sites
26 minutes ago, Brad said:

the 6502 was made by engineers with a problem to solve and they did it in a way that was not only effective, but super cheap to mass produce. 

This right here is why the 6502 was so popular. It was cheap, compared to the 8080, its best competition when it was first created. The 6502 cost $25 in 1975, compared to $360 for the 8080 at launch. 

Even if the 8080 went down in price over the two years between its release and the 6502 launch, I still doubt it went down to $25. Not even close. So when Apple and Commodore both set out to release an inexpensive home computer, it's no surprise they went with the MOS 6502, rather than the Intel 8080. 

Obviously, Intel's strategy won out, but that's mostly due to the success of PC clones, rather than the intrinsic merit of the processor. I actually do think the 8080 was a better CPU, but was it 7 times better? With the price of the two processors, I'd have made the same decision as Tramiel, Woz, and Jobs back in the 70s.

 

Edited by TomXP411
  • Like 1

Share this post


Link to post
Share on other sites

An example of that is how the subtract function is simply use the add circuit, but inverting the bits of the byte being subtracted. But without inverting the carry flag.

"But that is not the two's complement negative, to get the two's complement negative you need to add one!"

Oh, yeah, so for subtract, you set carry for borrow clear. If there is a borrow, leave carry clear for borrow set.

"But that makes no sense, that makes the borrow flag the inverse of the carry flag!"

Yes, exactly. And saves the transistors needed to invert the carry flag.

"But the CMP instruction needs to subtract too?"

Yes, and it needs to subtract with borrow clear. It doesn't care whether that sets the carry input to one or to zero.

 

 

Edited by BruceMcF

Share this post


Link to post
Share on other sites
10 minutes ago, BruceMcF said:

An example of that is how the subtract function is simply invert the bits of the byte being subtracted.

"But that is not the two's complement negative, to get the two's complement negative you need to add one!"

Oh, yeah, so for subtract, you set carry for borrow clear. If there is a borrow, leave carry clear for borrow set.

"But that makes no sense, that makes the borrow flag the inverse of the carry flag!"

Yes, exactly. And saves a lot of transistors.

... and requires you to do a CLC or SEC every time you do simple math, unless you know the state of the carry flag already. 

This is exactly the kind of tradeoff you get when you simply a system to save money... but obviously the result was worth it.

Edited by TomXP411

Share this post


Link to post
Share on other sites

Quite, using ADC for both ADC and ADD, SBC for both SBC and SUB means there are two operations that simply don't have to be provided, which is even more transistors saved.

Similarly, arithmetic shift right is less commonly needed, and if it is can be done by a routine, so leave it out.

Share this post


Link to post
Share on other sites

I can't imagine a descendant of the 6502 produced for the desktop today remaining with 8-bit registers and data bus, 16-bit address bus, and no extra ports and busses working out for a general audience outside of library terminals or something like the IT arrangement of the Continental Hotel, especially if this beast has a superscalar, multicore architecture.  Way too many computing architectural compromises would be required to make it work.

On the other hand, if either Commodore or Western Design Center had ca. the late 70s committed to something like Andre Fachat's 65K project...

Share this post


Link to post
Share on other sites
On 7/24/2021 at 12:24 AM, Kalvan said:

I can't imagine a descendant of the 6502 produced for the desktop today remaining with 8-bit registers and data bus, 16-bit address bus, and no extra ports and busses working out for a general audience outside of library terminals or something like the IT arrangement of the Continental Hotel, especially if this beast has a superscalar, multicore architecture.  Way too many computing architectural compromises would be required to make it work.

On the other hand, if either Commodore or Western Design Center had ca. the late 70s committed to something like Andre Fachat's 65K project...

Now, I should stress that I was not engaged in a serious design proposal up there, but rather playing a less than totally serious what if game that "just coincidentally" ended up with something like a CX16 memory map.

But, yeah, it's text manipulation or some other 8bit oriented date in any event -- it's not oriented to video -- given that the notion of single word Unicode has pretty much universally given way to UTF8 ... and it would have to be something that is leveraging the fact that there are VERY FEW TRANSISTORS in the 6502 hardware design ... so it's got to be a whole hell of a lot of 6502 cores on a chip ... so in that respect more like an array of GPU cores in some GPU designs, but targeting a text processing rather than video processing dedicate application.

So it really is a 6502 core with 512bytes of on-chip cache as dedicated zero page and stack page, taking advantage of the very small hardware footprint of the 6502 to be able to give it much more persistent local storage than is typically feasible with a processor array chip design ... and then the L2 cache is shared out between the cores somehow ... say, locations $0000-$0007 are byte L2 segment registers for the 64K address space in 8K segments, giving up to 2MB L2 cache. $0008/$0009 controls whether they are R/W, Read Only or Write Only, so that an individual bank can be used as a pipeline between two cores.

Actual RAM is accessed using a DMA controller accessed in the zero page address space of "one in N" of the 6502 cores, which controls the reading and writing between the L2 core and RAM.

All quite implausible, but fun to think about.

Edited by BruceMcF

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Sign in to follow this  

√ó
√ó
  • Create New...

Important Information

Please review our Terms of Use