← Back to Home

👤 Ankit Singh ( Senior Engineer - GPU Research ) 📅 Published: 10th Dec 2025 🔄 Updated: 13th Dec 2025

⏱️ 9-10 min read

#GPU #MobileGPU #GraphicsPipeline #TileBasedRendering #Adreno #Mali #Vulkan #RenderingOptimisation

A walkthrough of the graphics pipeline and the unique way mobile GPUs render scenes.Topic too clichéd? Maybe. Idea was to write a refresher for myself so that I can come back to it (atleast somebody should use it ;) ). But if you’ve spent any time in the GPU world — as a graphics programmer, hardware architect, performance engineer , yada , yada , yada — you’ll know the feeling: sooner or later, you always return to the basics of the pipeline. There’s something magical about how a humble vertex transforms into the breathtaking pixels lighting up your screen. Even after years of working on GPUs, that journey still feels like sorcery.

Triangle to pixels illustration
A triangle’s journey to pixels – the tiny magic trick every frame performs.

What began for me as simply memorising pipeline stages has become the foundation for every debug session, every bottleneck hunt, every architectural choice, and every rendering optimisation. And as I grew in my career, one question kept nagging me: if all mobile GPUs follow the same pipeline, why does one architecture outperform another?

So… drumroll please… this article won’t tell you which GPU is “better” — sorry for the clickbait 😅. Instead, I’ll walk you through the basic framework of the mobile GPU pipeline, and along the way, we’ll see how two giants — ARM Mali and Qualcomm Adreno — approach the same problems through very different lenses.

If you’re still with me (thank you), let’s dive into the good stuff.


1. How a GPU Understands Your Scene: The Core Inputs

The output of any great image depends on how thoughtfully the scene is constructed. A scene is made up of backgrounds, objects, lights, materials, and surface details. Each of these needs information at different levels of detail — geometry, colour, textures, illumination, normal maps, and many other attributes.

But don’t get overwhelmed at the start.

To create even a single image using a GPU, we first need to define the objects in the scene. At the most fundamental level, this comes down to three key ingredients:

a. Define where things are — Vertex Buffers
These store the positions of points (vertices) in 3D space. Together they form the skeleton or structure of every object in the scene.

b. Define how things look — Textures
Textures provide colours, patterns, roughness, normals, and other material properties that make objects appear believable without increasing geometric complexity.

c. Define how things are drawn efficiently — Index Buffers
Instead of duplicating vertices, index buffers let the GPU reuse shared vertices to form triangles. This saves memory and boosts rendering performance.

Core Data Structures at a Glance

Component Definition Example Benefits Contents
Vertex Buffer Stores vertex data (points in 3D). Provides the raw geometry that defines shapes. Cube corners Enables geometry that can be culled, clipped, and transformed with minimal CPU→GPU transfers. Position, Colour, Normals, UV coordinates
Index Buffer A list of integer references (indices) pointing into the vertex buffer to reuse vertices when forming triangles. Square with shared corners Reduces duplication, improves cache locality, and lowers memory usage. List of integers
Textures Images (2D, 3D, or procedural) mapped onto geometry. Add realism and detail without increasing geometry complexity. Brick wall texture Optimise rendering via compression, filtering, and mipmapping. Diffuse/Albedo, Normal, Roughness, Displacement maps

Once these are in place, the GPU has enough information to start its favourite job: turning triangles into pixels.

2. A Quick Tour of the Graphics Pipeline (GL / DX / Vulkan)

The GPU pipeline is the sequence of steps that transforms 3D data (vertices, buffers, textures) into the final 2D image displayed on your screen.

3D to 2D stages
Transforming 3D into 2D data

If you’re new to this whole “3D becoming 2D” idea, do yourself a favour — don’t scratch your head, scratch a pixel.
(Yes, I am absolutely pointing you to the legendary learning site Scratchapixel.com.)

The pipeline isn’t just a straight, simple conveyor belt. Each stage is deeply optimised, massively parallel, and often powered by specialised fixed-function hardware designed to do one thing insanely fast.

In the diagram:

Graphics pipeline stages diagram
A high-level view of the graphics pipeline from 3D data to 2D pixels.

Input Assembler

Gathers vertex and index buffers and forms primitives like triangles.

Fun detail:
This stage does zero math. It simply streams data from memory using dedicated fetch hardware, respecting cache lines and vertex reuse. Efficient layouts (AoS vs SoA, interleaved attributes, alignment) can noticeably affect bandwidth and cache behaviour.

Vertex Shader

Transforms vertices ( scale/rotation/..) into clip space and applies per-vertex operations.

Fun detail:
Vertex shaders run in SIMD lanes with typically low divergence. Modern GPUs aggressively batch vertices and cull invisible geometry early to avoid wasted work downstream.

Vertex Shader diagram
Vertex Shader

Tessellation (Hull Shader → Tessellator → Domain Shader)

Refines or subdivides geometry surfaces on the fly (based on patch data). It consists of two programmable stages (Hull, Domain) and one fixed-function stage (the Tessellator).

Fun detail:
Tessellation is powerful on desktop/console (for displacement mapping, smooth characters, terrain) but is usually missing in hardware on mobile. When it exists, it’s often not widely used by popular mobile games.

Tessellation Shader diagram
Tessellation Shader

Geometry Shader

Runs per primitive and can:

Fun detail:
Despite its flexibility, the Geometry Shader is generally slow because it breaks the GPU’s optimal batching model and increases memory traffic. Many modern engines avoid it or replace it with compute-based mesh generation or Mesh Shaders (DX12/Vulkan).

Geometry Shader diagram
Geometry Shader

Rasterizer

Converts triangles into pixel-sized fragments.Geometry finally stops being math and starts becoming pixels.

Fun detail:
The rasterizer uses edge equations and barycentric interpolation in fixed-function hardware. It can perform early depth testing and kill large blocks of fragments before they reach the pixel shader — saving huge amounts of work.

Pixel (Fragment) Shader

Runs per fragment, doing shading, texture sampling, lighting, BRDF evaluation, etc.

Fun detail:
This is the most expensive stage on most GPUs. Heavy texture access, complex material graphs, divergent branches, and overdraw can tank performance. Fragments are usually processed in 2×2 quads so derivatives (for mipmaps, etc.) can be computed cheaply.

Pixel Shader diagram
Pixel/Fragment Shader

Output Merger

Performs depth/stencil tests, blending, MSAA resolves, and writes final colour values to the render target.

Fun detail:
Blending is still fixed-function and highly optimised for contiguous memory writes. Modern hardware can combine multiple fragment results before hitting memory, but unordered or random writes can still thrash the ROPs (Raster Operations Pipeline).

The final pixel values are written to the framebuffer — and a few milliseconds later, the display scans them out to your screen.

OpenGL vs DirectX Names (Cheat Sheet)

# OpenGL vs DirectX Pipeline Stage Names

| Pipeline Stage                  | OpenGL Name                                           | DirectX (DX11/DX12) Name          |
| --------------------------------| ------------------------------------------------------ | --------------------------------- |
| Vertex Input                    | Vertex Specification (VAO / VBO / glVertexAttribPointer) | Input Assembler (IA)           |
| Vertex Shader                   | Vertex Shader (GLSL VS)                               | Vertex Shader (VS)                |
| Tessellation Control            | Tessellation Control Shader (TCS)                     | Hull Shader (HS)                  |
| Tessellator (Fixed Function)    | Tessellator                                           | Tessellator                       |
| Tessellation Evaluation         | Tessellation Evaluation Shader (TES)                  | Domain Shader (DS)                |
| Geometry Shader                 | Geometry Shader (GS)                                  | Geometry Shader (GS)              |
| Clipping & Projection           | Clipping + Perspective Divide (Fixed)                 | Clipping + Viewport Transform     |
| Rasterisation                   | Rasterizer                                            | Rasterizer Stage                  |
| Fragment / Pixel Shader         | Fragment Shader (FS)                                  | Pixel Shader (PS)                 |
| Depth / Stencil / Blend         | Per-Fragment Operations                               | Output Merger (OM)                |
| Render Output                   | Framebuffer                                           | Render Targets (RTV/DSV)          |

3. Rendering in Mobile GPUs

Before comparing how mobile GPU hardware works, we need to reveal the not-so-secret sauce behind almost every modern mobile GPU:

They almost never use classic Immediate Mode Rendering (IMR) — the pipeline mental model many of us learnt from desktop GPUs.

Instead, mobile GPUs lean heavily on Tile-Based Rendering (TBR).

Why? Because the hardware budgets are radically different.

With that kind of bandwidth and power gap, taking a naïve desktop-style IMR pipeline and dropping it into a phone would be catastrophic.

In a traditional IMR pipeline, every triangle and fragment typically involves:

On a 400 W desktop card, this is manageable.
On a 3 W mobile GPU, it kills performance and thermals.

Why Tile-Based Rendering Wins on Mobile

Tile-Based Rendering reorganises the pipeline around locality and on-chip memory:

Fewer DRAM accesses → less bandwidth → less heat → more sustained FPS.

This strategy makes sense for mobile because it:

Variants of Tile-Based Rendering

Different vendors implement tiling differently:

Why Tiling Changes Everything

This is why I’m stressing Tile-Based Rendering before jumping into the detailed pipeline. It’s not just a nice optimisation; it fundamentally reshapes mobile GPU architecture.

Because we must save bandwidth and keep as much work on-chip as possible, the GPU needs to avoid processing geometry or fragments that won’t contribute to the final image. This leads to a two-pass strategy:

1. Binning Pass

The screen is divided into small tiles. The GPU determines which triangles touch which tile, performing a coarse visibility classification. Only the geometry relevant to each tile is queued for further processing.

2. Rendering Pass

For each tile, only the actually visible triangles enter rasterisation. Depth tests, blending, MSAA, and colour writes all happen inside the on-chip tile buffer. Only the final resolved tile is written to DRAM.

This architectural strategy drives a whole ecosystem of hardware choices:

These aren’t minor tweaks — they make mobile GPUs a different species compared to traditional desktop GPUs.

Tile-Based Rendering is therefore not just an optimisation;
it’s the reason mobile GPUs can exist within a 2–5 W thermal envelope while still pushing console-like visuals.

Example of tile binning and triangle visibility
A naïve example showing how binning keeps only the triangles relevant to each tile.
Comparison of tiled vs immediate rendering
Another illustration comparing how tiled and immediate modes behave for the same scene.

Conclusion

So… this feels like a good place to end my first blog everrrrrrrrrrrr :) ( Sorry got excited ... Ctrl Ctrl Uday ) .

Now you understand why I spent so much time talking about Tile-Based Rendering before even touching “How does the triangle actually move inside ARM vs Qualcomm?”.

If you don’t first appreciate why mobile GPUs are built the way they are, then explaining how triangles move through them makes no sense. The hardware is different because the constraints are different — and that single fact shapes everything else.

Yes, I did slightly clickbait you with the promise of ARM vs Qualcomm pipelines (guilty).
But don’t worry — I haven’t forgotten the real question you came here for:

“How does a triangle actually travel inside a mobile GPU?”

That is coming next.

Next episode teaser poster
Part 2 teaser – where the real architectural carnage begins.

And just like my favourite Bollywood saga — Gangs of Wasseypur Part 2 — I’m saving the real carnage, drama, and architectural twists for Part 2.

In the next chapter, I’ll break down how two giants of mobile graphics —
Qualcomm Adreno and ARM Mali
take completely different approaches to the same fundamental question.

Can it be a blockbuster of a deep dive into their architectures?
We’ll see. 😉

Keep TRYing, stay WELL, and keep deBUGing.


Comments