Introduction to GPU programming on Nvidia hardware with Julia
This post is a reformatted and cut down version of the half-day tutorial called Parallel and Fast given to the Australian Institute of Marine Science (AIMS).
In this post we briefly discuss why parallelism matters, then delve into the modern Nvidia architecture for GPUs, using a Mandelbrot visualization as an example.
Why parallel?
For decades, computers got faster mainly because chipmakers squeezed more transistors onto chips and raised their clock speeds --- this is Moore's law. But after the early 2000s, increasing clock speeds hit limits due to heat and power dissipation limits. In order to get more power into machines, CPU manufacturers instead started putting multiple cores on CPUs: identical and somewhat isolated execution units that can do work independently. In essence, it became harder to make CPUs faster so they just put many on one die.
GPUs, which were designed for graphics, also became much more powerful and began to be used for general-purpose computing.
Now, to speed up programs beyond what a single core can do, we have to use parallel computing across multiple cores, CPUs, GPUs, or even multiple machines working together --- this is called distributed computing. For a further discussion, see the article by Sutter et al. (2005).
This brings a whole new set of challenges to writing programs: now we can't just write code that runs linearly in the obvious order, instead we have to come up with algorithms that can be run in parallel and speed up execution by performing different parts on different logical computation units. This brings with it a whole new hoarde of bugs, but also incredible time savings and performance throughput improvements, including even the ability to do computation that would not be possible without them (e.g. training LLMs would never be possible on single-threaded CPUs).
History of GPGPU
GPGPU stands for General Purpose GPU computing: originally GPUs were built for and only really good for accelerating graphical workflows: texture mapping, shading, rendering polygons and such for gaming and professional applications (CAD, architecture, rendering, etc). They were highly optimized for doing these kinds of computations, which often involved computing small matrix multiplications over and over again. GPU performance was measured in things like polygons per second (which is a bit of a nonsensical measure). See Vuduc and Choi (2013).
Around the early 2000s, some hackers and researched started coming up with novel ways of using GPUs for non-graphics applications. These worked largely by writing custom pixel shaders that were supposed to give graphical programs the ability to write very restrictive, but general code to manipulate fragments of textures and bitmaps. These applications creatively abused the graphics pipelines to do simple parallel computations. For example:
- 2001: Martin Rumpf et al. from University of Duisburg used GPUs to solve FEM calculations (Rumpf and Strzodka (2001))
- 2003: Mark Harris et al. from UNC used GPUs to simulate cloud dynamics (Harris et al. (2003));
- 2005: Nico Galoppo et al. from UNC used GPUs to solve linear systems (Galoppo et al. (2005));
- 2004: Debra Cook et al. from Columbia used GPUs for speeding up cryptography (Cook and Keromytis (2006)); and
- 2004: Ian Buck et al. from Stanford created Brook, an early start at programming GPUs, introduced kernels directly compiled by a custom compiler for the GPU (Buck et al. (2004)).
Buck in particular showed through Brook (during his PhD) that it's possible to implement operations like linear algebra and FFT on GPUs with his methodology. He quickly got hired by NVIDIA and was one of the core individuals who created CUDA.
The stats on the number of floating point operations (FLOPs) a GPU could do compared to a CPU made it a very attractive platform for early research. This was directly enabled by the parallelism story on GPUs compared to CPUs --- they were parallel from the beginning. A 2003 article from the magazine Computer states "The new GPUs are very fast. Stanford's Ian Buck calculates that the current GeForce FX 5900's performance peaks at 20 Gigaflops, the equivalent of a 10-GHz Pentium", from Macedonia (2003) (the whole article is a very interesting read).
The cryptoanalysis and hacker communities also used GPUs for computing rainbow tables and bruteforcing passwords and hash pre-images1. This was exacerbated by the Crypto Wars: futile efforts by the United States government to limit the availability strong encryption to the general public and in export products (cryptography is still classified as a dual-use technology and its exports are somewhat restricted). The new availability of highly-parallel hardware combined with new ideas around this time made for practically feasible attacks on widely deployed cryptosystems, such as cracking passwords in networked Microsoft operating systems (e.g. LAN Manager), cracking early WiFi encryption (WEP, and later WPA), and breaking commonly used hashes (such as the abysmal MD5). Cryptography (and laws around cryptography) have come a long way since then.
A crash course to modern NVIDIA architectures
GPUs are largely fast for two reasons:
- Massive Parallelism: instead of having few cores that implement a very powerful and rich instruction set (e.g. 2-32 on CPUs), GPUs have hundreds or thousands of small "streaming multiprocessors" (e.g. an NVIDIA RTX 3090 has 10496 CUDA cores).
- Massive Memory Bandwidth: a high end workstation has 50-100 GB/s of memory bandwidth, compared to 1000 GB/s+ on GPUs (old school RTX 3090 has ~900 GB/s, a top of the line B200 has 8000 GB/s).
GPUs achieve these incredible stats with a completely different architecture to CPUs.
Single Instruction, Multiple Thread (SIMT)
The core of GPU parallelism lies in the Single Instruction, Multiple Thread (SIMT) execution model. Unlike CPUs where each core executes independent instructions, a GPU's Streaming Multiprocessor (SM) operates on groups of threads called warps (32 threads). All threads within a warp execute the same instruction simultaneously, but on different data.
A CPU has a very sophisticated superscalar pipeline to execute a rich set of instructions one after the other, to allow for expressing a huge generality of programs and executing them simultaneously. A GPU on the other hand is built to run basically the same code on different data extremely efficiently.
Computational model
The NVIDIA GPU execution model organizes computation in a hierarchy:
- Grid: The entire computation defined by the user. A grid is composed of independent thread blocks.
- Thread Block (Block): A group of threads that can cooperate via shared memory and synchronization. All threads in a block execute on the same Streaming Multiprocessor (SM, see below). Max 1024 threads/block.
- Warp: The fundamental unit of execution on an SM. A warp consists of 32 threads that execute the same instruction in lock-step (i.e. SIMT). If threads in a warp diverge (e.g.,
if-elsestatements), execution paths are serialized (run one after the other, e.g. firstif, thenelse), impacting performance. - Streaming Multiprocessor (SM): The core processing unit of the GPU. An SM contains:
- CUDA Cores: For general-purpose scalar/vector operations (FP32, INT32, etc.).
- Tensor Cores: Specialized units for matrix operations, particularly for ML.
- Load/Store Units (LSUs): For memory access.
- Special Function Units (SFUs): For transcendental functions.
- Shared Memory/L1 Cache: Fast, on-chip memory accessible by threads within the same block.
- Warp Schedulers: Manage and dispatch warps to available execution units. An SM can concurrently execute multiple active warps (up to its resource limits).
- CUDA Cores / Tensor Cores / LSUs / SFUs: These are the actual execution units within an SM that perform the computations dictated by the warp's instructions.
Hierarchy Flow:
- A Grid is launched onto the GPU.
- The GPU schedules Thread Blocks onto available SMs. An SM can host multiple blocks concurrently, limited by its resources (registers, shared memory).
- Each Thread Block is further divided into Warps (32 threads).
- SMs fetch and dispatch instructions from active Warps to their respective CUDA Cores, Tensor Cores, LSUs, and SFUs for execution.
This hierarchy allows for massive parallelism by distributing work across many SMs, and within each SM, by executing multiple warps concurrently and leveraging specialized hardware units.
Quick reminder on CPU architecture and memory bandwidth
Here is what a CPU cache hierarchy looks like:
](multicore-cache.png)
Consider my main desktop built around 2020, consisting of an AMD Ryzen 5800X, and 64 GB of DDR4 RAM at 3200 MT/s. Here are the specs:
- Cores/SMT threads: 8/16
- L1 cache: 64 KB per core (32 KB inst/32 KB data)
- L2 cache: 512 KB per core
- L3 cache: 32 MB total
- CPU cache line: 64 bytes
- RAM: dual-channel DDR4
- Memory data path width: 64 bits per channel
- Memory speed: 3200 MT/s
Modern CPUs read memory one cache line at a time: that is, 64 bytes, or 512 bits. When my CPU fetches some data, it first looks in L1 cache, then L2 cache, then L3 cache. If it doesn't find it in any of them, it issues a DDR burst to grab a cache line worth of data.
When the CPU issues a DDR read command, the memory module will transfer 64 bits (that's the memory data path per channel) of data at a time from RAM to the CPU. The CPU generally reads one cache line at a time so 64 bytes, which is 8x64 bit reads.
With two DDR4 modules (sticks of RAM), in dual-channel mode, we can read twice the bytes. 3200 MT/s means 3200 million transfers per second. This gives \(3200\times 2\times 64/8\) million bytes per second, or roughly 50 GB/s of sequential read from RAM. This is the theoretical maximum rate at which we can process large datasets on my PC.
This is just the max bandwidth. Random reads will be much slower, a good rule of thumb is a tenth of it. This also does not consider latency. See this:
From <https://gist.github.com/jboner/2841832>
Latency Comparison Numbers
----------------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
GPU memory architecture
While our CPU's memory bandwidth of approximately 50 GB/s for sequential reads is impressive for a general-purpose processor, this is peanuts compared to GPUs. Let's consider the NVIDIA RTX 3090, a late 2020 high-end consumer GPU:
- Memory: 24 GB GDDR6X
- Memory Interface Width: 384-bit
- Memory Speed (Effective): 19.5 Gbps (Gigabits per second)
This gives a theoretical maximum memory bandwidth of 384/8 bytes at 19.5 Gbps = 936 GB/s. That's 18.7x the CPU.
Nvidia GPU programming in Julia: the Mandelbrot Set
The simplest class of algorithms to parallelize are called embarrassingly parallel: those where we are literally doing isolated computations that do not depend on each other. They are a good place to start to learn about splitting computation apart in Julia.
The Mandelbrot set \(\mathcal M\subseteq\mathbb{C}\) is the set of numbers \(c\in\mathbb{C}\) (in the complex plane) where iterating the function \(f_c(z)=z^2+c\) does not diverge to infinity (where we start iterating with \(z=0\)). Clearly for example \(0+0i\in\mathcal M\) but \(4+0i\not\in\mathcal M\).
More formally we define the Mandelbrot set \(\mathcal M\) as follows. Fix \(c\in\mathbb{C}\), define \(z_0=0\) and set \(z_{k+1}=f_c(z_k)=z_k^2+c\). Then \(c\in\mathcal M\) if and only if \(\limsup_{k\to\infty}|z_k|<\infty\).
For example, at \(c=-1+0i\), we have the iterates
- \(z_0=0\),
- \(z_1=0^2-1=1\),
- \(z_2=(-1)^2-1=0\),
- we hit a loop, and so clearly \(z_k\) is bounded. Therefore \(-1\in\mathcal{M}\).
On the other hand, at \(c=3i\), we have the iterates
- \(z_0=3i\),
- \(z_1=(3i)^2+3i=3i-9\),
- \(z_2=(3i-9)^2+3i=72-51i\),
- now it's pretty clear that the values will diverge, so \(3i\not\in\mathcal{M}\).
Observe that if \(|z_n|>2\) at any point, then we are guaranteed that the iterates go to infinity. This helps us simplify our code to compute \(\mathcal M\).
We can write a Julia function to compute this as follows:
function mandelbrot_iter(c::Complex, max_iter::Int=50)::Int
# initializes z to the zero element of its type, in this case 0+0i
z = zero(c)
for i in 1:max_iter
z = z*z + c
# abs2 computes the square of the absolute value, saves us a tiny bit by not computing the square root
if abs2(z) > 4
return i
end
end
return max_iter
endHere we are outputting the iteration on which the sequence has \(|z|>2\), as this somehow measures how divergent a point is, letting us make pretty images.
Let's now write a function to run it over a grid:
function compute_mandelbrot(
# real (x-axis) values
real_range,
# imaginary (y-axis) values
imag_range;
# max iterations per run
max_iter::Int
)::Matrix{Int}
# produce the grid of c values
c_grid = complex.(real_range', imag_range)
# apply mandelbrot_iter to each value of c_grid
return mandelbrot_iter.(c_grid, max_iter)
end;
# let's define a box that contains most of the set
width, height = 1500, 1500
real_range_full = LinRange(-2., 0.5, width)
imag_range_full = LinRange(-1.25, 1.25, height);
# and run to see see how fast we go:
using BenchmarkTools
@time my_mb = compute_mandelbrot(real_range_full, imag_range_full, max_iter=100);On my machine, this takes about 0.155540 seconds.
Nice picture:
using Plots
heatmap(my_mb, size=(500, 500), axis=false, aspect_ratio=:equal, colorbar_title="Iterations until provable divergence")
Zoom in on the top part with some more iterations
real_range_top = LinRange(-.3, .1, width)
imag_range_top = LinRange(.6, 1, height)
@time my_mb_top = compute_mandelbrot(real_range_top, imag_range_top, max_iter=250);
heatmap(my_mb_top, size=(500, 500), axis=false, aspect_ratio=:equal, colorbar_title="Iterations until provable divergence")
CUDA.jl
Julia makes it extremely simple to get stuff running on a GPU with CUDA.jl. Let's see an example
# this might take a moment
using CUDA
# CUDA.@time is needed to synchronize GPU data to make sure we time everything
CUDA.@time cuda_mb = compute_mandelbrot(CuArray(real_range_full), CuArray(imag_range_full), max_iter=100);
heatmap(cuda_mb, size=(500, 500), axis=false, aspect_ratio=:equal, colorbar_title="Iterations until provable divergence")
On my machine, this takes about 0.005708 seconds, 27.2x faster than singlethreaded CPU.
And the top part:
# the first run is normally slow due to compiling, this run should be much much faster
CUDA.@time cuda_mb_top = compute_mandelbrot(CuArray(real_range_top), CuArray(imag_range_top), max_iter=250);
heatmap(cuda_mb_top, size=(500, 500), axis=false, aspect_ratio=:equal, colorbar_title="Iterations until provable divergence")
Recap of Mandelbrot timings
| Execution mode | Time (msec) | Speedup |
|---|---|---|
| CPU (single thread) | 155.54 | 1x |
| CPU (multithreaded) | 46.68 | 3.3x |
| GPU (CUDA) | 5.71 | 27.2x |
See the AIMS tutorial for multithreaded code, which I include here for comparison.
Kernel Functions: The Building Blocks of GPU Computation
A "kernel" is a function that runs on the GPU. When you launch a kernel, thousands of threads are created, and each thread executes the kernel function on its own piece of data.
# First, ensure CUDA is functional and you have a device
if !CUDA.functional()
error("CUDA is not functional! Do you have an NVIDIA GPU and proper drivers?")
end
CUDA.device!(0) # Select GPU 0 (if you have multiple)To write your own kernel, you define a regular Julia function, but it needs to be compatible with GPU execution. You then launch it using the @cuda macro. Inside a kernel, you access special functions to determine the current thread's ID and block ID within the grid:
blockIdx().x,blockIdx().y,blockIdx().z: The 1D, 2D, or 3D index of the current thread block.blockDim().x,blockDim().y,blockDim().z: The dimensions (number of threads) of the current thread block.threadIdx().x,threadIdx().y,threadIdx().z: The 1D, 2D, or 3D index of the current thread within its block.gridDim().x,gridDim().y,gridDim().z: The dimensions (number of blocks) of the grid of thread blocks.
From these, you calculate a global unique index for each thread:
# A simple kernel that adds two arrays element-wise
function add_kernel!(C, A, B)
# Calculate global linear index for the current thread
# This formula works for 1D arrays or flat 2D/3D arrays
idx = (blockIdx().x - 1) * blockDim().x + threadIdx().x
# Check if the index is within the bounds of the array
# This is crucial as you often launch more threads than elements
if idx <= length(A)
@inbounds C[idx] = A[idx] + B[idx] # @inbounds for performance, assumes bounds checking is handled
end
return
end
# Host-side function to prepare data and launch the kernel
function gpu_add_arrays(A_host, B_host)
# Move input data to the GPU
A_d = CuArray(A_host)
B_d = CuArray(B_host)
# Allocate memory for the result on the GPU
C_d = similar(A_d)
N = length(A_host)
# Determine launch configuration:
# threads_per_block: How many threads in each block (typically 128, 256, 512, 1024)
# blocks_per_grid: How many blocks needed to cover all N elements
threads_per_block = 256
blocks_per_grid = ceil(Int, N / threads_per_block)
# Launch the kernel!
@cuda threads=threads_per_block blocks=blocks_per_grid add_kernel!(C_d, A_d, B_d)
# Synchronize the GPU (wait for all kernel computations to finish)
# This happens implicitly when you transfer data back or if you need results immediately.
# explicit CUDA.synchronize() can be used for timing.
# Move the result back to the CPU (if needed for further CPU processing)
return Array(C_d)
end
N_gpu_example = 10^7
A_h = rand(Float32, N_gpu_example)
B_h = rand(Float32, N_gpu_example)
println("Starting custom GPU kernel array addition...")
@time C_h_gpu_custom = gpu_add_arrays(A_h, B_h)
println("Custom GPU kernel array addition finished.")Starting custom GPU kernel array addition...
0.412340 seconds (667.84 k allocations: 72.529 MiB, 2.76% gc time, 84.90% compilation time)
Custom GPU kernel array addition finished.
This direct kernel programming gives you maximum control but can be complex.
Broadcast Fusion: The Magic of Element-wise Operations
When you perform element-wise operations on CuArrays using Julia's broadcasting syntax (e.g., .+, .*, or the dot syntax @.), CUDA.jl automatically compiles and launches an optimized GPU kernel for you. This is incredibly efficient because it can fuse multiple operations into a single kernel, minimizing memory access and launch overheads.
using CUDA
N_implicit = 10^7
X_h = rand(Float32, N_implicit)
Y_h = rand(Float32, N_implicit)
# Move to device once
X_d = CuArray(X_h)
Y_d = CuArray(Y_h)
println("\nStarting GPU implicit kernel operations...")
# Complex broadcast operation - Julia automatically generates an efficient kernel
@time Z_d = @. X_d * sin(Y_d^2) + sqrt(X_d)
println("GPU implicit kernel operations finished.")
# Julia's Base functions often have GPU implementations that work seamlessly
@time S_d = CUDA.sum(Z_d) # Optimized GPU reduction
println("GPU sum finished.")
# If you need the result back on the CPU:
Z_h_implicit = Array(Z_d)
S_h_implicit = S_d[] # Get the scalar value from the CuArray scalar
# Compare with CPU version for timing and correctness
println("\nStarting CPU operations for comparison...")
@time Z_h_cpu = @. X_h * sin(Y_h^2) + sqrt(X_h)
println("CPU element-wise finished.")
@time S_h_cpu = sum(Z_h_cpu)
println("CPU sum finished.")
println("Difference in sum: $(abs(S_h_implicit - S_h_cpu))")You'll often find that for typical data processing tasks, this implicit kernel generation through broadcasting is all you need, and it performs remarkably well.
Performance Considerations and Best Practices
While GPUs are powerful, they are not a magic bullet. To get the most out of them, consider these points:
- Data Transfer Overhead: Copying data between CPU and GPU is expensive. Avoid frequent transfers. Try to keep data on the GPU for as long as possible, processing it in stages there.
- Kernel Launch Overhead: There's a small overhead associated with launching each kernel. For very small computations, this overhead can sometimes negate the GPU's advantage. Try to combine multiple small operations into a single larger kernel (which Julia's broadcast fusion often does for you).
- Memory Management: GPU memory is a finite resource. Be mindful of large allocations.
CUDA.jlgenerally manages memory well, but for very large problems, you might need to reuse memory or process data in chunks. - Algorithm Choice: Not all algorithms are "GPU-friendly." Algorithms with frequent, unpredictable memory access patterns or high thread divergence (where threads in a warp take different execution paths) will perform poorly. Algorithms that operate uniformly on large datasets (e.g., matrix multiplication, image filters, finite difference methods) are ideal.
Profiling CUDA
Once you have your CUDA code running, you'll often want to optimize its performance. Profiling helps you identify bottlenecks in your code. Julia's CUDA.jl package integrates well with NVIDIA's profiling tools.
Basic GPU Profiling
For a quick and easy way to see the time spent on individual kernel launches within your Julia code, you can use the built-in CUDA.@profile macro. This provides a high-level overview directly in your Julia REPL or output.
using CUDA
function saxpy_kernel!(Y, A, X)
idx = (blockIdx().x - 1) * blockDim().x + threadIdx().x
if idx <= length(Y)
@inbounds Y[idx] = A * X[idx] + Y[idx]
end
return
end
N = 10_000_000
A_scalar = Float32(2.0)
X = CUDA.fill(Float32(1.0), N)
Y = CUDA.fill(Float32(3.0), N)
# Profile the kernel call
CUDA.@profile begin
@cuda threads=256 blocks=cld(N, 256) saxpy_kernel!(Y, A_scalar, X)
endStarting GPU implicit kernel operations...
0.971768 seconds (1.67 M allocations: 83.383 MiB, 86.54% compilation time)
GPU implicit kernel operations finished.
1.491146 seconds (3.32 M allocations: 168.340 MiB, 2.77% gc time, 87.43% compilation time)
GPU sum finished.
Starting CPU operations for comparison...
0.179868 seconds (437.39 k allocations: 60.804 MiB, 64.51% compilation time)
CPU element-wise finished.
0.022318 seconds (56.61 k allocations: 2.852 MiB, 93.37% compilation time)
CPU sum finished.
Difference in sum: 0.0
This tells you the execution time of the kernel itself on the GPU.
For deeper profiling you will want to use NVIDIA Nsight Systems.
References
Buck, Ian, Theresa Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. 2004. "Brook for GPUs: Stream Computing on Graphics Hardware." ACM Transactions on Graphics (TOG) 23 (3): 777–86.
Cook, Debra, and Angelos D Keromytis. 2006. Cryptographics: Exploiting Graphics Cards for Security. Vol. 20. Springer Science & Business Media.
Galoppo, Nico, Naga K Govindaraju, Michael Henson, and Dinesh Manocha. 2005. "LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware." In SC'05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, 3–3. IEEE.
Harris, Mark J, William V Baxter, Thorsten Scheuermann, and Anselmo Lastra. 2003. "Simulation of Cloud Dynamics on Graphics Hardware." In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, 92–101.
Macedonia, Michael. 2003. "The GPU Enters Computing's Mainstream." Computer 36 (10): 106–8.
Rumpf, Martin, and Robert Strzodka. 2001. Using Graphics Cards for Quantized FEM Computations. Citeseer.
Sutter, Herb et al. 2005. "The Free Lunch Is over: A Fundamental Turn Toward Concurrency in Software." Dr. Dobb's Journal 30 (3): 202–10.
Vuduc, Richard, and Jee Choi. 2013. "A Brief History and Introduction to GPGPU." Modern Accelerator Technologies for Geographic Information Science, 9–23.
- It's not entirely clear how widespread the use of GPUs was for cracking cryptosystems before the introduction of CUDA (after which it exploded). There were certainly examples and it was a budding technique in the community, but I couldn't find concrete and widely used examples to cite. [return]