Ever since 3dfx debuted the original Voodoo accelerator, no single piece of equipment inside a PC has already established because an effect on whether your machine could game because the humble video card. While other components absolutely matter, a top-end PC with 32GB of RAM, a $4,000 CPU, and PCIe-based storage will choke and die if asked to run modern AAA titles on a ten-year-old card at modern resolutions and detail levels. Graphics cards, aka GPUs (Graphics Processing Units) are critical to game performance and that we cover them extensively. But we don’t often dive into what makes a GPU tick and how the cards function.

By necessity, this will be a high-level summary of GPU functionality and cover information present with AMD, Nvidia, and Intel’s integrated GPUs, as well as any discrete cards Intel might build in the future based on the Xe architecture. It should also be present with the mobile GPUs built by Apple, Imagination Technologies, Qualcomm, ARM, along with other vendors.

Why Don’t We Run Rendering With CPUs?

The first point I want to address is why we don’t use CPUs for rendering workloads in gaming in the first place. The honest response to this question is you can run rendering workloads on a CPU. Early 3D games that predate the widespread accessibility to graphics cards, like Ultima Underworld, ran entirely on the CPU. UU is a useful reference case for multiple reasons — it had a more advanced rendering engine than games like Doom, with full support for looking up and down, as well as then-advanced features like texture mapping. However this kind of support came in a heavy price — lots of people lacked a PC that may actually run the game.

Ultima Underworld. Image by GOG

In the first times of 3D gaming, many titles like Half-Life and Quake II featured an application renderer to permit players without 3D accelerators to experience the title. However the reason we dropped this option from modern titles is simple: CPUs are created to be general-purpose microprocessors, which is another way of saying they don't have the specialized hardware and capabilities that GPUs offer. A modern CPU could easily handle titles that tended to stutter when running in software 18 years ago, but no CPU on Earth could easily handle a modern AAA game from today if run in that mode. Not, a minimum of, without some drastic changes towards the scene, resolution, as well as other visual effects.

As an enjoyable example of this: The Threadripper 3990X is capable of running Crysis in software mode, albeit not every that well.

What’s a GPU?

A GPU is a device with a group of specific hardware capabilities which are meant to map well to the method in which various 3D engines execute their code, including geometry setup and execution, texture mapping, memory access, and shaders. There’s a relationship between your way 3D engines function and exactly how GPU designers build hardware. A number of you might remember that AMD’s HD 5000 family used a VLIW5 architecture, while certain high-end GPUs in the HD 6000 family used a VLIW4 architecture. With GCN, AMD changed its method of parallelism, in the name of extracting more useful performance per clock cycle.

AMD’s follow-up architecture to GCN, RDNA, doubled recorded on the idea of boosting IPC, with instructions dispatched every clock cycle. This improved IPC by 25 percent. RDNA2 has generated on these gains and added features just like a huge L3 cache to increase performance further.

Nvidia first coined the word “GPU” using the launch of the original GeForce 256 and its support for performing hardware transform and lighting calculations on the GPU (this corresponded, roughly towards the launch of Microsoft’s DirectX 7). Integrating specialized capabilities directly into hardware was a hallmark of early GPU technology. A lot of those specialized technologies are still employed (in completely different forms). It’s more power-efficient and faster to have dedicated resources on-chip for handling specific types of workloads than to attempt to handle all of the operate in just one variety of programmable cores.

There exist several differences between GPU and CPU cores, but in a higher level, you are able to consider them like this. CPUs are typically made to execute single-threaded code as quickly and efficiently as possible. Features like SMT / Hyper-Threading enhance this, but we scale multi-threaded performance by stacking more high-efficiency single-threaded cores side-by-side. AMD’s 64-core / 128-thread Epyc CPUs are the largest you can purchase today. To place that in perspective, the lowest-end Pascal GPU from Nvidia has 384 cores, as the highest core-count x86 CPU on the market tops out at 64. A “core” in GPU parlance is a smaller processor.

Note: You can't compare or estimate relative gaming performance between AMD, Nvidia, and Intel by simply comparing the number of GPU cores. Inside the same GPU family (for instance, Nvidia’s GeForce GTX 10 series, or AMD’s RX 4xx or 5xx family), a higher GPU core count implies that GPU is more powerful than the usual lower-end card. Comparisons based on FLOPS are suspect for reasons discussed here.

The reason you can’t draw immediate conclusions on GPU performance between manufacturers or core families based solely on core counts is that different architectures are more and less efficient. Unlike CPUs, GPUs are made to operate in parallel. Both AMD and Nvidia structure their cards into blocks of computing resources. Nvidia calls these blocks an SM (Streaming Multiprocessor), while AMD describes them like a Compute Unit.

A Pascal Streaming Multiprocessor (SM).

OK

Each block contains a number of cores, a scheduler, a register file, instruction cache, texture and L1 cache, and texture mapping units. The SM / CU can be thought of as the smallest functional block from the GPU. It doesn’t contain literally everything — video decode engines, render outputs required for actually drawing a picture on-screen, and also the memory interfaces used to communicate with onboard VRAM are all outside its purview — however when AMD describes an APU as having 8 or 11 Vega Compute Units, this is the (equivalent) block of silicon they’re talking about. And if you appear in a block diagram of the GPU, any GPU, you’ll notice that it’s the SM/CU that’s duplicated twelve or even more times in the image.

And here’s Pascal, full-fat edition.

The higher the number of SM/CU units in a GPU, the greater work it can perform in parallel per clock cycle. Rendering is a kind of problem that’s sometimes referred to as “embarrassingly parallel,” meaning it has the possibility to scale upwards extremely well as core counts increase.

When we discuss GPU designs, we often use a format that looks something similar to this: 4096:160:64. The GPU core count may be the first number. The larger it is, the faster the GPU, provided we’re comparing inside the same family (GTX 970 versus GTX 980 versus GTX 980 Ti, RX 560 versus RX 580, and so on).

Texture Mapping and Render Outputs

There are a couple of other major components of a GPU: texture mapping units and render outputs. The number of texture mapping units in a design dictates its maximum texel output and how quickly it may address and map textures on to objects. Early 3D games used very little texturing since the job of drawing 3D polygonal shapes was difficult enough. Textures aren’t actually required for 3D gaming, though the listing of games that don’t use them in the modern age is incredibly small.

The number of texture mapping units inside a GPU is signified through the second estimate the 4096:160:64 metric. AMD, Nvidia, and Intel typically shift these numbers equivalently because they scale a GPU family up and down. In other words, you won’t really find a scenario where one GPU includes a 4096:160:64 configuration while a GPU below or above it in the stack is really a 4096:320:64 configuration. Texture mapping can absolutely be a bottleneck in games, but the next-highest GPU in the product stack will typically offer a minimum of more GPU cores and texture mapping units (whether higher-end cards have more ROPs depends on the GPU family and also the card configuration).

Render outputs (also sometimes called raster operations pipelines) are in which the GPU’s output is assembled into an image for display on a monitor or television. The number of render outputs multiplied through the clock speed from the GPU controls the pixel fill rate. A greater number of ROPs means that more pixels could be output simultaneously. ROPs also handle antialiasing, and enabling AA — especially supersampled AA — can lead to a game title that’s fill-rate limited.

Memory Bandwidth, Memory Capacity

The last components we’ll discuss are memory bandwidth and memory capacity. Memory bandwidth refers to just how much data could be copied back and forth from the GPU’s dedicated VRAM buffer per second. Many advanced visual effects (and better resolutions more generally) require more memory bandwidth to run at reasonable frame rates simply because they boost the amount of data being copied into and from the GPU core.

In certain cases, deficiencies in memory bandwidth could be a substantial bottleneck for a GPU. AMD’s APUs such as the Ryzen 5 3400G are heavily bandwidth-limited, meaning upping your DDR4 clock rate may have a substantial impact on overall performance. The option of game engine may also possess a substantial effect on how much memory bandwidth a GPU needs to avoid this issue, just like a game’s target resolution.

The amount of on-board memory is yet another critical element in GPUs. When the quantity of VRAM required to run at a given detail level or resolution exceeds available resources, the sport will often still run, but it’ll need to use the CPU’s main memory for storing additional texture data — and it takes the GPU vastly longer to pull data from DRAM instead of its onboard pool of dedicated VRAM. This can lead to massive stuttering as the game staggers between pulling data from a quick pool of local memory and general system RAM.

One aspect to be familiar with is the fact that GPU manufacturers will sometimes equip a low-end or midrange card with more VRAM than is otherwise standard in an effort to charge a bit more for the product. We can’t make a complete prediction whether this makes the GPU more attractive because honestly, the results vary with respect to the GPU in question. What we should can tell you is the fact that in many cases, it isn’t worth paying more for a card if the only difference is really a larger RAM buffer. As a rule of thumb, lower-end GPUs have a tendency to encounter other bottlenecks before they’re choked by limited available memory. Much more doubt, check reviews of the card to check out comparisons of whether a 2GB version is outperformed by the 4GB flavor or regardless of the relevant amount of RAM could be. Generally, assuming all else is equal forwards and backwards solutions, you’ll discover the higher RAM loadout not worth spending money on.

Check out our ExtremeTech Explains series for more in-depth coverage of today's hottest tech topics.