headermask image

header image

RISC vs. CISC in the mobile era

Meanwhile, back in the day…

Back in 1998, when I first began covering hardware at the newly launched Ars Technica, much of my writing focused on issues raised by the raging Mac vs. PC flame wars that took place in computing forums across the Internet. These flame wars were often centered around the esoteric issue of instruction set architecture (ISA), as partisans on each side argued over which type of ISA was superior—those ISAs based on the RISC philosophy expressed in the PowerPC processors that powered the Mac, or the CISC camp that most think is exemplified by Intel’s x86 processors.

Incidentally, in those last years of the 20th century, dial-up was officially on its way out, and broadband Internet was completing its transformation from The Next Big Thing into a fact of urban life. The rise of broadband Internet brought about a shift in the way that people used and related to their computers. The “home PC” moved from being a monolithic, self-contained work and entertainment device with its own piece of furniture, to being a vehicle for social and intellectual discovery.

Fast-forward to today, and the future of the pocket looks eerily similar to what the future of the desktop looked like from the standpoint of a decade ago. WiMAX and 3G networks are poised to make mobile broadband ubiquitous in large urban centers, and two ISAs, one RISC and one CISC, are poised to do battle yet again on a new terrain that’s defined by a shift in how people use their computers.

Unlike 1998, though, RISC vs. CISC actually matters, now. A close look at the design of Intel’s newest mobile architecture, officially named Atom, will show why the decades-old “RISC vs. CISC” debate is suddenly interesting again, and in some entirely new ways. In this article, I’ll talk about the penalty that Intel’s new Atom ultramobile processor pays for its CISC legacy, and how Intel plans to reduce the impact of that penalty with simultaneous multithreading.
Hostile territory

Before I get directly into the topic of Atom’s microarchitecture, I want to take a moment to talk about the market at which future derivatives of Atom will eventually be aimed. Right now, with its 2W TDP, Atom isn’t much of an embedded processor in the classical sense, but some future version of it will be. Atom is more like the point of a wedge that Intel intends to drive down into the embedded market, with the x86 legacy code base acting as the mass behind that point and the company’s relentless march of process shrinks acting as the applied force.

“Embedded” processors are commonly so called because they seamlessly integrated into a range of different appliance-like devices, from microwaves to cars to routers. Because they’re often crammed into all sorts of odd places where computers don’t normally go, such processors are typically very low-power and rugged. They also don’t need a lot of horsepower to carry out the kinds of tasks that they’re normally assigned, so they eschew the complexity and bulk of desktop processors in favor of simplicity, leanness, and the ability to use very little energy.

More recently, “embedded” has also come to refer to processors that are in mobile phones and handheld computers, and in that respect it now refers generically to processors that operate in the milliwatt range.

RISC architectures currently have the embedded space just as tightly sewed up as RISC architectures had the workstation space in the years before the Pentium Pro. The big three for embedded are ARM, PowerPC, and MIPS. Of these three, ARM is far and away the most popular for gadgets and consumer electronics. Empty your pockets and purse or briefcase of mobile gadgetry, and chances are there’s at least one ARM-based chip under the skin of each device with a battery and a screen. Whether you’re packing a Blackberry or a Nintendo DS Lite, you’re toting ARM hardware.

To ARM and its vast army of licensees, Intel’s mockup-heavy bluster about x86-powered UMPCs, MIDs, and smartphones might look just as laughably silly as the Pentium Pro and its successors looked to the likes of SGI and DEC, if it weren’t for the fact that x86 is now wearing the scalps of both now-defunct RISC powerhouses as trophies. No, in today’s world, when Intel aims both barrels of its formidable fab capacity directly at a new market segment, it’s not a joke to anyone who ends up in the crosshairs.

The Atom line of processors, formerly known by the codename of “Silverthorne,” is a shot across the bow of the embedded RISC players. It’s not a direct hit, and isn’t intended to be; but Intel is gunning for ARM and its cohorts, so let’s take a look at its opening salvo.
Intel’s Atom

When Intel unveiled the microarchitecture for Atom at ISSCC this year, they also served up some crow for me to eat along with it. Some readers may remember that I greeted Intel’s claims that “Silverthorne” (Atom’s codename) was a “new architecture from the ground up” with a bit of skepticism, and suggested that it was probably a derivative of an existing microarchitecture.

Intel Atom. Source: Intel

Clearly, I should’ve taken Intel at face value, because Atom is an entirely new design. It’s relatively simple and in-order, much like the original Pentium, but it’s also fairly deeply pipelined like more modern processors. Overall, it’s definitely clear that Atom’s designers tried to walk a fine line between modern features and performance on the one hand, and simplicity and power efficiency on the other. Only real-world benchmarks will indicate how well their tradeoffs worked, but a close look at Atom’s pipeline will show what x86 compatibility actually costs the processor.

Note: This article looks in detail only at the fetch and decode phases of Atom’s front end. For more on Atom as a whole, and on its execution engine, see this previous article.

Instruction fetch and decode

If you’re wondering why Atom’s power requirements are so much higher than that of a comparably complex ARM processor like the Cortex A-8, then part of the answer is here in the front end of the processor, specifically in the decode phase of the processor’s pipeline. In a nutshell, the decades-old x86 instruction set is filled with layer after layer of clutter, sort of like the garage of someone who never throws anything away. It takes a lot of (power-hungry) hardware to accommodate that clutter—hardware that more modern, leaner instruction sets like ARM don’t need. IF1 IF2 IF3 ID1 ID2 ID3 SC IS IRF AG DC1 DC2 EX1 FT1 FT2 IWB/DC

Instruction Fetch Decode Dispatch Reg. File Data cache read Execute Exceptions & MT Write-back

Atom’s pipeline

Atom’s three-cycle fetch phase fetches instructions from the 32K instruction cache into a set of prefetch buffers. These prefetch buffers then feed instructions into a predecode queue, where their lengths, which vary from instruction to instruction, are detected and marked at a rate of 2 instructions/cycle. (I’d also speculate that branches are detected and marked in this phase, as well, because that’s where it’s typically done).

Atom’s decoder

From this predecode queue, the instructions flow into the decoding hardware proper. Atom’s decoding block consists of two fast hardware decoders and one slow microcode ROM for longer, more complex instructions. Intel has not revealed what percentage of instructions go through the pair of fast decoders and what percentage go through the one slow decoder, but these percentages are fairly important. The fewer instructions use the microcode engine, the better.

It’s also not clear how many cycles that instructions spend in the fast decoder, but the predecode and decode phases together take a total of 3 cycles.

Ultimately, the decode block as a whole (the two hardware decoders plus the microcode ROM) can send up to two instructions per cycle (from one thread or two) into a 16-entry instruction queue, before issuing one instruction per cycle to the execution units.

The costs of x86 compatibility

Even though Atom lacks an instruction window for dynamically reordering the instruction stream (more on this below), its decode block still appears to decompose the variable-length x86 instructions into fixed-length micro-ops (or “uops”) like the Pentium Pro and its successors. This means that the decode hardware described above is relatively more bulky than it might otherwise be. Specifically, RISC instructions don’t need to be predecoded, marked for length, and aligned for entry into the decoders because they’re all the same length to start with. So while RISC instructions on a comparable design (ARM or PowerPC) do sit in an instruction queue prior to decode, that queue is a much simpler structure.

Not only does the predecode hardware bulk up Atom’s front end, but the microcode ROM that’s needed for decoding all of those legacy, multicycle x86 instructions also has a relatively large footprint. This is yet another item that a comparable ARM processor can do without, thereby saving on hardware and size. Finally, the two hardware decoders have substantially more work to do in translating x86 instructions into fixed-length uops than they would if Atom’s execution engine were simpler and could accept instructions in a format closer to that of the original Pentium. All told, the extra hardware associated with the predecode queue decode block add up to give Atom a pretty hefty power penalty over a comparable RISC design.

As a point of reference, Atom’s front end looks very much like that of the original Pentium Pro; it has the same structures (predecode queue, two hardware decoders, a microcode ROM, and a uop queue) as the older chip. The Pentium Pro paid a huge price in transistors relative to its RISC competitors for its x86-to-uops decode hardware. Intel was ultimately able reduce this relative penalty to something very marginal by means of sheer fabrication muscle. The chipmaker aggressively shrunk transistor sizes and added cache and other performance-enhancing hardware (execution units, deeper out-of-order buffers, robust branch prediction, etc.), with the end result that the size of the decode block didn’t grow that much between CPU generations while the rest of the chip ballooned around it, making the decode hardware a relatively smaller part of the chip as time went on.

In the milliwatt TDP regime, the actual size of the core counts, and as transistors get smaller they leak more current. So the old approach of using process shrinks to cram more non-decode-related hardware onto each new processor, so that the decode block becomes an ever smaller slice of a constantly expanding pie, doesn’t really work anymore. As a result, Intel has to find new ways to pay down the x86 decode penalty.

The new strategy that the company seems to have settled on for both Atom and for the small cores used in the forthcoming Larrabee graphics part is to use simultaneous multithreading to amortize the cost of the x86 decode block over multiple threads. Here’s how it works.

Simultaneous multithreading and performance/watt

While some folks see simultaneous multithreading (SMT) as a performance-enhancing feature, it’s really aimed at one thing: SMT increases execution unit utilization by funneling more instructions to the processor, so that the front end and back end both spend less time waiting on instructions and data to load from memory and more time doing useful work. More work to do means less wasted processor cycles, and less wasted cycles mean more power efficiency overall.

So for workloads that are multithreaded, SMT can provide a real performance per watt boost, but it does so at a cost.

Simultaneous multithreading adds some extra bulk to Atom’s front end and execution core, because a number of key structures must be duplicated. Specifically, Atom contains two of each of the following structures, one for each thread:

Predecode queue

16-entry Instruction queue

64-bit Integer register file

128-bit Floating-point/SIMD register file

Atom’s designers clearly thought that the power efficiency advantage afforded by SMT was enough to offset the duplication of key structures required. I’m not a mind reader, but I suspect that there are two factors that probably motivated them to make this trade-off.

First, Atom has no instruction window that would have to be enlarged to accommodate SMT. An instruction window (a reorder buffer and a set of reservation stations) is an unusually costly feature, not just in terms of transistors but in terms of raw power consumption. Unlike many of the parts of the processor that go idle and can be shut down, the reorder buffer is active as long as the processor is executing code. So this fairly large structure, which sees constant use, has to grow even larger for SMT processors like Intel’s forthcoming Nehalem. Atom’s lack of an instruction window means that the addition of SMT is less costly than it would be otherwise.

The second factor is one that I alluded to in the previous section: SMT lets an x86 core amortize the high cost of the decode block over multiple threads. To understand what I’m talking about, it helps to consider briefly how Atom’s competition does things.

ARM’s flagship embedded part is the Cortex A-9, a modular, out-of-order core designed for multicore implementations. The A-9 was officially unveiled in late 2007, and I came away from a briefing on it fairly impressed. I haven’t seen any real A-9 benchmarks, given how well the in-order Cortex A-8 does I think it’s likely that an A-9 core will beat the pants off of an Atom core in performance/watt on single-threaded workloads.

Individual A-9 cores have a very, very small footprint, so that they can be easily ganged together in two- and four-core embedded systems-on-a-chip. ARM doesn’t support SMT with A-9 because they’d clearly rather just put down more, smaller cores instead. But an x86 design probably can never afford to take this approach, because of the high cost of the decode block and of the rest of the hardware needed to support the full range of x86 ISA extensions. I would hypothesize that it’s more power-efficient for an x86 design to attach two of each shared structure listed above to a single decode block and a single set of 64-bit and 128-bit datapaths, than to split those structures into two separate single-threaded cores. In contrast, an ARM design is better off with a “one thread, one core” model because it doesn’t need to amortize the cost of all that hardware over multiple threads.

Conclusions

In the near to medium term, Atom and its predecessors will pay a relatively hefty price for their CISC legacy. More benchmarks will give a clearer picture of whether tricks like SMT, when combined with Intel’s ongoing process engineering leadership, will lower that price enough to enable x86 to squeeze ARM out of some future version of an iPhone-like device. In addition to SMT, the other microarchitectural trick that I see on the long-term horizon for Atom is the return of an instruction window; this will provide a real boost in performance/watt, and Intel may be able to introduce it at 32nm or later.

The question of how well Intel will fare in the renewed RISC vs. CISC battle that’s taking shape in the mobile space hinges largely on the answer to two questions:

How much is the legacy x86 code base really worth for mobile and ultramobile devices? The consensus seems to be “not much,” and I vacillate on this question quite a bit. This question merits an entire article of its own, though.

Will Intel retain its process leadership vs. foundries like TSMC, which are rapidly catching up to it in their timetables for process transitions? ARM, MIPS, and other players mobile device space that I haven’t mentioned like NVIDIA, AMD/ATI, VIA, and PowerVR, all depend on these foundries to get their chips to market, so being one process node behind hurts them. But if these RISC and mobile graphics products can compete with Intel’s offerings on feature size, then that will neutralize Intel’s considerable process advantage.

This article is just my first attempt to take stock one corner of the radically altered terrain on which this new battle plays out, so I’ll withhold any more detailed thoughts on the answers to these questions for a later date. Besides, I haven’t even talked about what VIA and AMD/ATI are up to. It’s also the case that GPUs will play a considerable part in the embedded processor wars, as well, so any discussion that focuses solely on the processor microarchitecture will miss the mark. And because these GPUs will be on the same die the processor core their relationship to the processor core—both in terms of their physical integration and in terms of how the different vendors align with one another—is much different than it is on the desktop.

In all, it’s once again a great time to be a processor geek, and I look forward to another decade of providing fodder for a whole new round of platform wars.

SOURCE 

If you liked my post, feel free to subscribe to my rss feeds

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*