Thursday, 27 October 2011

Analyzing Bulldozer: Why AMD’s chip is so disappointing

AMD FX swoosh and Bulldozer die

Share This article

AMD’s Bulldozer is finally here, after years of development — and its performance is significantly worse than anyone expected. The situation is ugly enough that it may explain why so many executives left AMD over the past twelve months, and why the company was so tight-lipped about their departure. Bulldozer’s general performance has been widely covered; our goal here is to drill into why the CPU performs the way it does rather than covering it in a wide range of real world scenarios.
Note: AMD’s Turbo Core and Intel’s Turbo Mode were disabled on all chips, in order to prevent them from adjusting the CPU’s clock speed and throwing off results. As a consequence, the results here will be lower than in a standard review, particularly for single-thread performance. 
The first thing to understand about Bulldozer is that it leverages aspects of simultaneous multi-threading to combine the functions of what would normally be two discrete cores into a single package (AMD refers to this combination as a “module“). Each module contains what Windows identifies as two cores, but combining instruction scheduling and CPU resources has an impact on CPU scaling in multi-threaded tests when compared to the same programs running on “traditional” multi-core processors.
AMD Bulldozer
When AMD designed Bulldozer, it was aiming for a CPU that would be easier to ramp to higher frequencies while maintaining the same IPC (instructions per clock cycle) as its six-core predecessor. In order to hit higher clockspeeds, AMD lengthened the CPU’s pipeline and increased latencies throughout the architecture. The concept of building chips for higher frequency has had a bad rap since the disastrous Prescott Pentium 4; after seeing Bulldozer’s overall performance, AMD’s decision to take this route may not have been a very good one. As things stand, the FX-8150 struggles to surpass Thuban in a number of tests while its IPC definitely took a hit.
AMD Bulldozer
Before we dig into the CPU’s architecture, however, there’s an OS factor to discuss. According to AMD, Windows 7 doesn’t understand Bulldozer’s resource allocation very well. Windows 7 “sees” eight independent CPU cores, despite the fact that each module shares scheduling and execution resources. Sometimes it makes the most sense to spin threads off to idle cores before scheduling them on cores already busy with something else. Other times, it’s best to spin two related threads off to the same core. Windows 8 will apparently be much more proficient at scheduling work loads where it makes the most sense to execute them.
This issue has a practical impact on the CPU’s performance because of the way AMD’s Turbo Core is implemented. The new flavor of Turbo Core is meant to increase maximum clock speed by up to two speed grades if only four cores are enabled. Since Windows 7 doesn’t understand which cores to turn off, however, the CPU is less likely to increase its clock speed as high as it otherwise would. “Turbo” speeds were originally introduced by Intel as a way to squeeze more performance out of lightly-threaded or single-threaded workloads, but Bulldozer’s architecture makes those extra megahertz particularly important.
AMD Bulldozer performance
AMD Bulldozer performance
We checked the impact of Windows 7′s scheduler by measuring CPU performance in Maxwell Render 1.7 and Cinebench 11.5. Both programs allow the user to define a specific number of threads (four, in our case). The 4M/8C label means that all eight cores are active, 4M/4C means that all four modules are active, with one core operating per module, and 2M/4C denotes a dual-module/quad-core configuration. Both of these tests show a 4M/4C arrangement outperforming a 4M/8C system by roughly eight percent when four threads are used. This suggests that scheduler inefficiencies could indeed be hurting Bulldozer’s general performance in workloads that can’t take advantage of all eight cores.
Share This article
This performance discrepancy isn’t unique to AMD. Over on the Intel side of things, it can be faster to turn Hyper-Threading off if you’ve got an application that doesn’t scale above four cores. This issue hurts AMD more than Intel for two reasons. First, Bulldozer’s single-threaded performance lags Sandy Bridge by a substantial margin; the chip needs every bit of additional leverage it can get. Second, and more importantly, there’s the fact that Bulldozer’s performance almost always takes a heavy hit when running in 2M/4C mode. Let’s take a closer look, starting with what should be a near best-case scenario: DIEP.
DIEP is a chess simulator that calculates the potential position of every piece on the board through a sequence of moves. A ply depth of one means the program has calculated every potential move a single turn into the game; a ply depth of 15 means every potential move 15 turns deep. The program spins off a pre-defined number of independent threads and uses no floating point code, which makes it useful for examining Bulldozer’s integer performance in different configurations.
AMD Bulldozer diep
There’s a 22% gap between running DIEP on four separate modules and running it on two modules. Refer up to our Cinebench and Maxwell Render tests, and you’ll see similar gaps; it takes the 2M/4C configuration 20% longer to render our benchmark scene in Maxwell, and it’s 16% slower than the 4M/4C configuration in Cinebench. This is precisely where AMD’s more aggressive Turbo Core is meant to kick in–but given Windows 7′s imperfect scheduling, performance still takes a hit in a 2M/4C configuration, even if Turbo Mode is on (we checked).
One of the major questions surrounding Bulldozer has been how much of a penalty the chip’s SMT-style arrangement creates compared to a typical multi-core processor. Our results suggest that Bulldozer takes a 15-20 percent hit compared to a standard multi-core configuration. That’s actually pretty decent trade-off, particularly considering that this is the first Bulldozer-style CPU AMD has built. Unfortunately, the performance hit is significant enough to undermine AMD’s strategy of outflanking Intel by offering more CPU cores. Eight Bulldozer cores end up looking a lot like six Thuban cores, which is part of why AMD’s new chip struggles to pull away from its older cousin.
Cache latencies are likely another reason.

Cache size (and latency)

Bulldozer’s cache latencies are significantly higher than Thuban or Sandy Bridge’s, and the caches themselves are proportioned differently. Previous AMD processors had 64K instruction and 64K data caches for a total of 128K of L1 per core. Bulldozer, in contrast, has just 16K of L1 data cache per core and shares a 64K instruction cache per module. In theory, 16K of L1 is enough — Sandy Bridge has a 16K L1 data cache — but then, Sandy Bridge’s L2 and L3 caches are much faster than their AMD counterparts.
AMD Bulldozer
Note: All latencies measured by SiSoft Sandra 2011 SP5 using a random access pattern.
Bulldozer’s caches are significantly larger than Thuban’s, and a great deal slower as well. In theory, higher-latency caches allow the CPU to reach higher frequencies, but that doesn’t explain the entirety of the situation. Even at 4.6GHz — 40% faster than Thuban — Bulldozer’s L2 cache takes 40% longer to access. It’s impossible to estimate how much Bulldozer’s cache latencies are hurting the chip’s performance, but it’s likely significant. Slapping so much cache on the chip — 16MB of it — have also made it much more expensive to manufacture. Bulldozer is smaller than Thuban (315mm2 vs. 346mm2), but practically colossal compared to Sandy Bridge’s svelte 216mm2. Bulldozer’s caches may prove to be an advantage in some server workloads, but we suspect the desktop version of the chip could afford to go on a diet.
Comparative performance between the 2600K, the AMD X6 1100T, and Bulldozer (at stock and overclocked speeds) is below:
AMD Bulldozer
Disabling Turbo mode lets us accurately measure the IPC hit Bulldozer takes compared to Thuban. An 11% decrease is downright ugly considering how much Thuban already lagged Sandy Bridge. Statistically, we’d need to push the FX-8150 to around 5.5GHz to match Sandy Bridge’s 3.4GHz performance in this test.
DIEP Performance of Bulldozer
A 20% performance penalty for an SMT design is good—but Bulldozer’s IPC reduction leaves the eight-core chip unable to match Thuban in certain workloads.
AMD Bulldozer
Maxwell Render is one of the only tests where Bulldozer demonstrates a native performance advantage over Thuban. There are applications where Bulldozer shines — just not many of them.

Overclocking presents a dubious solution

AMD has gone out of its way to mention how well Bulldozer can overclock. While it’s true that BD hits much higher frequencies than Thuban (we had no trouble with 4.6GHz), the performance benefit from doing so isn’t large enough to unilaterally hammer through the chip’s shortcomings. This is particularly true considering the paucity of octal-threaded programs on the desktop. Single-threaded performance (and multi-threaded performance at the 2-4 core level) are more important than the total number of cores, and BD doesn’t do well here.
Push Bulldozer up to 4.6GHz and it definitively outstrips the Phenom II X6 1100T and gains ground against the Core i7-2600K. The problem with factoring overclocking into any conversation on CPU value, however, is that the other chips under consideration can be overclocked as well. Our Maxwell Render results suggest that there’s a range of software where an overclocked Bulldozer can deliver better performance than either Thuban or Intel’s quad-core Sandy Bridge — but it isn’t a very large space.

Sandra’s multimedia tests can be configured to showcase Bulldozer’s improvements. FPU performance is remarkably similar to Thuban’s, despite the fact that BD has just four FPUs compared to Thuban’s six, while x16 integer performance is higher, even, than Sandy Bridge’s.
Bulldozer’s improved SSE performance (above) and AVX support (below) may help the chip in some corner cases, but at least some AVX-enabled benchmarks, like the Kribi 3D tests available at inartis.com are actually slower on Bulldozer when AVX is used than they are otherwise. It’s not clear if this is because Bulldozer’s AVX implementation is narrower than Intel’s, or because the chip’s SSE capabilities make that instruction set a better fit. Similarly, Bulldozer includes support for multiple new CPU instructions, but AMD’s ability to convince developers to adopt them and recompile code for optimum performance is limited.
SiSoft Sandra 2011, AMD Bulldozer
Bulldozer may technically feature AVX, but its comparative performance isn’t very good, even in integer code. The i7-2600K has four FPUs — just like AMD’s chip — but it can handle 256-bit AVX instructions without splitting them into 2×128-bit chunks.

Unpleasant reality

Bulldozer, and AMD with it, is stuck in an unenviable position. It doesn’t decisively outperform its predecessor, and AMD’s decision to trade IPC for clock cycles didn’t pay off. As a result, Bulldozer’s single-threaded performance is worse than the processor it replaces. Higher clock speeds would help Bulldozer pull past Thuban’s single-thread performance, but the gap between BD and Sandy Bridge is much too large to be bridged by operating frequencies.
Bulldozer has been compared to the Pentium 4 on multiple occasions — including by us — but their similarities only go so far. When the P4 debuted, it struggled to surpass its predecessor despite a 50 percent frequency advantage over the 1GHz Pentium III. Bulldozer is nowhere near as drastic. A lower latency L2 cache, possibly combined with a larger L1 data cache, would likely result in a significant speed increase, while cutting down on the total amount of L2 would save die space and reduce cost.

It may seem counter-intuitive to stress the importance of improving single-threaded performance after all the emphasis AMD has put on multi-threading, but it’s probably the company’s best bet. There’s no point to stuffing more cores into desktops, no changing Windows 7′s sub-optimal scheduling, and Bulldozer is unlikely to push above 4.2GHz before its successor, Piledriver, is available (if it even gets that high).
If AMD wants to compete effectively with Intel, it has to push Bulldozer’s IPC in a positive direction. This is doubly important for mobile parts, where TDP limits won’t allow for the same degree of overclocking and Llano already operates at a significant performance disadvantage compared to Sandy Bridge. It may not be possible to improve much on the 20 percent scaling hit Bulldozer takes compared to Thuban, which makes single-thread IPC all the more vital.
Overall, the current generation of Bulldozer parts aren’t going to do much for AMD’s desktop position. They’re slightly smaller and a bit faster in some workloads. Enthusiasts who love AMD and overclocking will benefit the most while everyone else waits to see what the first mobile Bulldozer parts are like. AMD’s next-generation mobile CPU, codenamed Trinity, is effectively a second-generation Bulldozer product and incorporates the advances that’ll eventually debut on the desktop as the Piledriver core. AMD frankly has its work cut out for it — Trinity will need to deliver at least a 10-15% performance improvement over Bulldozer if its to serve as an effective replacement for Llano.




0 comments:

Post a Comment