Introduction to GPU programming
Instead of going over theory first, we're going to dive straight into writing GPU code and explain the concepts as we go. This is a top-down learning approach where you get context for "what" happens, and then dive into the "why".
Your first kernel
In the context of GPU programming, a kernel is a program that runs on each thread that you launch. There are various ways to communicate between threads, this is just a simple example to demonstrate what's happening:
from gpu import thread_idx
fn printing_kernel():
print("GPU thread: [", thread_idx.x, thread_idx.y, thread_idx.z, "]")We can pass this function as a parameter to enqueue_function to compile it for your attached GPU and launch it. First we need to get the DeviceContext for your GPU:
from gpu.host import DeviceContext
var ctx = DeviceContext()Now we have the DeviceContext we can compile and launch the kernel:
ctx.enqueue_function[printing_kernel](grid_dim=1, block_dim=4)
# Wait for the kernel to finish executing before handing back to CPU
ctx.synchronize()/tmp/mdlab/main.mojo:95:83: error: use of unknown declaration 'DeviceBuffer'
fn custom_warp_reduce_kernel(tensor: LayoutTensor[dtype, layout], out_buffer: DeviceBuffer[dtype]):
^~~~~~~~~~~~
mojo: error: failed to parse the provided Mojo source moduleThreads
Because we passed block_dim=4, we launched 4 threads on the x dimension, the kernel code we wrote is executed on each thread. The printing can be out of order depending on which thread reaches that print call first.
Now lets add the y and z dimensions with block_dim(2, 2, 2):
ctx.enqueue_function[printing_kernel](grid_dim=1, block_dim=(2, 2, 2))
ctx.synchronize()GPU thread: [ 0 0 0 ]
GPU thread: [ 1 0 0 ]
GPU thread: [ 0 1 0 ]
GPU thread: [ 1 1 0 ]
GPU thread: [ 0 0 1 ]
GPU thread: [ 1 0 1 ]
GPU thread: [ 0 1 1 ]
GPU thread: [ 1 1 1 ]We're now launching 8 (2x2x2) threads in total.
Host vs Device
If you see the enqueue word in a method call it means it's asynchronous, you have to await it's completion where appropriate using ctx.synchronize, this ensure the GPU has completed it's work. For example, if you were to print from the CPU without first calling synchronize the CPU will likely print out of order before the GPU has completed it's work:
ctx.enqueue_function[printing_kernel](grid_dim=1, block_dim=4)
print("This will finish before the GPU has completed it's work")
ctx.synchronize()This will finish before the GPU has completed it's work
GPU thread: [ 0 0 0 ]
GPU thread: [ 1 0 0 ]
GPU thread: [ 2 0 0 ]
GPU thread: [ 3 0 0 ]You'll see the word host which refers to the CPU that coordinates the kernel launches, device refers to the accelerator which in this case is a GPU. In the above example we failed to call synchronize before printing on the host, the device will be slightly slower to finish it's work, so you see that output after the host output. Let's fix the placement of synchronize:
ctx.enqueue_function[printing_kernel](grid_dim=1, block_dim=4)
ctx.synchronize()
print("This will finish after the GPU has completed it's work")GPU thread: [ 0 0 0 ]
GPU thread: [ 1 0 0 ]
GPU thread: [ 2 0 0 ]
GPU thread: [ 3 0 0 ]
This will finish after the GPU has completed it's workBlocks
Lets set up a new kernel to demonstrate how blocks work:
from gpu import block_idxfn block_kernel():
print(
"block: [",
block_idx.x,
block_idx.y,
block_idx.z,
"]",
"thread: [",
thread_idx.x,
thread_idx.y,
thread_idx.z,
"]"
)
ctx.enqueue_function[block_kernel](grid_dim=(2, 2), block_dim=2)
ctx.synchronize()block: [ 1 1 0 ] thread: [ 0 0 0 ]
block: [ 1 1 0 ] thread: [ 1 0 0 ]
block: [ 0 0 0 ] thread: [ 0 0 0 ]
block: [ 0 0 0 ] thread: [ 1 0 0 ]
block: [ 1 0 0 ] thread: [ 0 0 0 ]
block: [ 1 0 0 ] thread: [ 1 0 0 ]
block: [ 0 1 0 ] thread: [ 0 0 0 ]
block: [ 0 1 0 ] thread: [ 1 0 0 ]We're still launching 8 (2x2x2) threads, where there are 4 blocks, each with 2 threads. In GPU programming this grouping of blocks and threads is important, each block has can have it's own fast shared VRAM (Video Random Access Memory) which allows threads to communicate. The threads within a block can also communicate by through registers, we'll cover this concept when we get to warps. For now the important imformation to internalize is:
grid_dimdefines how many blocks are launchedblock_dimdefines how many threads are launched in each block
Tiles
The x, y, z dimensions of blocks are important for splitting up large jobs into tiles so each thread can work on it's own subset of the problem. Lets vizualize how a contiguous array of data can be split up into tiles, if we have an array of UInt8 (Unsigned Integer 8bit) data like:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ]We could split work up between threads and blocks, we're only going to use the x dimension for threads and blocks to get started:
Thread | 0 1 2 3
-------------------------
block 0 | [ 0 1 2 3 ]
block 1 | [ 4 5 6 7 ]
block 2 | [ 8 9 10 11 ]
block 3 | [ 12 13 14 15 ]If you had a much larger data array you could further split it up in into tiles, e.g. tile with widths [2, 2] at index (0, 0) would be:
[ 0 1 ]
[ 4 5 ]And index (1, 0) would be:
[ 2 3 ]
[ 6 7 ]This is where you'd introduce the y dimension, we'll cover this later but for now we're going to focus on how blocks and threads interact.
Host Buffer
First we'll initialize that contiguous array on CPU and fill in the values:
from math import iota
alias dtype = DType.uint32
alias blocks = 4
alias threads = 4
# One value per thread
alias in_bytes = blocks * threads
# Allocate data on the host and return a buffer which owns that data
var in_host = ctx.enqueue_create_host_buffer[dtype](in_bytes)
# Fill in the buffer with values from 0 to 16
iota(in_host.unsafe_ptr(), in_bytes)
# Load all the data as a SIMD vector of width 16 and print it
print(in_host.unsafe_ptr().load[width=in_bytes](0))[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]In the last call where we print it using load[width=in_bytes] we're loading a SIMD vector of width 16. SIMD means Single Instruction Multiple Data, in Mojo all the core numerical dtypes are built around SIMD, allowing you to e.g. multiply a vector in a single instruction using special CPU registers, instead of 16 instructions for each element.
Device Buffer
We now have a host buffer that we can copy to the GPU:
# Allocate a buffer for the GPU
var in_dev = ctx.enqueue_create_buffer[dtype](in_bytes)
# Copy the data from the CPU to the GPU buffer
ctx.enqueue_copy_to_device(in_dev, in_host.unsafe_ptr())Creating the GPU buffer is allocating global memory which can be accessed from any block and thread, this memory is relatively slow compared to shared memory which is shared between blocks, more on that later.
Tensor indexing from threads
Now that we have the data set up, we can wrap the data in a LayoutTensor so that we can reason about how to index into the array, allowing each thread to get its corrosponding value:
from layout import Layout, LayoutTensor
# Row major: elements are stored sequentially in memory [0, 1, 2, 3, 4, 5, ...]
# Column major: used in some GPU optimizations, stored as [0, 4, 8, 12, 1, ...]
alias layout = Layout.row_major(blocks, threads)
var tensor = LayoutTensor[dtype, layout](in_dev.unsafe_ptr())This LayoutTensor is a mutable view over the data stored inside in_dev, it doesn't own it's memory but allows us to index using block and thread ids. Initially we'll just print the values to confirm it's indexing as we expect:
fn print_values_kernel(tensor: LayoutTensor[dtype, layout]):
var bid = block_idx.x
var tid = thread_idx.x
print("block:", bid, "thread:", tid, "val:", tensor[bid, tid])
ctx.enqueue_function[print_values_kernel](
tensor, grid_dim=blocks, block_dim=threads,
)
ctx.synchronize()block: 3 thread: 0 val: 12
block: 3 thread: 1 val: 13
block: 3 thread: 2 val: 14
block: 3 thread: 3 val: 15
block: 0 thread: 0 val: 0
block: 0 thread: 1 val: 1
block: 0 thread: 2 val: 2
block: 0 thread: 3 val: 3
block: 2 thread: 0 val: 8
block: 2 thread: 1 val: 9
block: 2 thread: 2 val: 10
block: 2 thread: 3 val: 11
block: 1 thread: 0 val: 4
block: 1 thread: 1 val: 5
block: 1 thread: 2 val: 6
block: 1 thread: 3 val: 7As in the vizualization above, the block/thread is getting the corrosponding value that we expect. You can see block: 3 thread: 3 has the last value 15.
Mulitply Kernel
Now that we've verified we're getting the correct values when indexing, lets launch a kernel to multiply each value:
fn multiply_kernel[multiplier: Int](tensor: LayoutTensor[dtype, layout]):
tensor[block_idx.x, thread_idx.x] *= multiplier
ctx.enqueue_function[multiply_kernel[2]](
tensor,
grid_dim=blocks,
block_dim=threads,
)
ctx.synchronize()
# Copy data back to host and print as 2D array
ctx.enqueue_copy_from_device[dtype](in_host.unsafe_ptr(), in_dev)
var host_tensor = LayoutTensor[dtype, layout](in_host.unsafe_ptr())
print(host_tensor)
_ = in_host.unsafe_ptr()0 1 2 3
4 5 12 14
16 18 20 22
24 26 28 30Congratulations! You've successfully run a kernel that modifies values from your GPU, and printed the result on your CPU. You can see above that each thread multiplied a single value by 2 in parallel on the GPU, and copied the result back to the CPU.
Sum Reduce Output Tensor
We're going to set up a new buffer which will have all the reduced values with the sum of each thread in the block:
Output: [ block[0] block[0] block[1] block[1] ]There's no need for a layout tensor here as it's a 1D array:
# Allocate to global memory
var out_dev = ctx.enqueue_create_buffer[dtype](blocks)
# Also create a buffer for the host
var out_host = ctx.enqueue_create_host_buffer[dtype](blocks)
ctx.synchronize()The problem here is that we can't have all the threads summing their values into the same index in the output buffer as that will introduce race conditions. We're going to intoduce new concepts to deal with this.
Shared memory
This is not an optimal solution, but it's a good way to introduce shared memory in a simple example, we'll cover better solutions in the next sections. You can allocate data for each block when launching a kernel, this is called shared memory and every thread within a block can communicate using it.
from gpu import barrier, lane_id
from gpu.memory import AddressSpace, external_memory
from sys import sizeof
from gpu.host import DeviceBufferfn sum_reduce_kernel(tensor: LayoutTensor[dtype, layout], out_buffer: DeviceBuffer[dtype]):
# Get a pointer to shared memory for the indices and values
var shared = external_memory[
Scalar[dtype],
address_space = AddressSpace.SHARED,
alignment = sizeof[dtype](),
]()
# Place the corrosponding value into shared memory
shared[thread_idx.x] = tensor[block_idx.x, thread_idx.x][0]
# Await all the threads to finish loading their values into shared memory
barrier()
# If this is the first thread, sum and write the result to global memory
if thread_idx.x == 0:
for i in range(threads):
out_buffer[block_idx.x] += shared[i]
ctx.enqueue_function[sum_reduce_kernel](
tensor,
out_dev,
grid_dim=blocks,
block_dim=threads,
shared_mem_bytes=blocks * sizeof[dtype](), # Shared memory betwee blocks
)
ctx.synchronize()
# Copy the data back to the host and print out the SIMD vector
ctx.enqueue_copy_from_device(out_host.unsafe_ptr(), out_dev)
ctx.synchronize()
print(out_host.unsafe_ptr().load[width=blocks]())[6, 22, 38, 54]For our first block/tile we summed the values:
sum([ 0 1 2 3 ]) == 6And the reduction resulted in the output having the sum of 6 in the first position. Every tile/block has been reduced to:
[ 6 22 38 54]Thread level SIMD
We could skip using shared memory altogether using SIMD instructions, this is a faster option to consider if it suits your problem. Each thread has access to SIMD registers which can perform operations on a vector such as reductions. Here we'll be launching one thread per block, loading the 4 corrosponding values from that block as a SIMD vector, and summing them together in a single operation:
fn simd_reduce_kernel(tensor: LayoutTensor[dtype, layout], out_buffer: DeviceBuffer[dtype]):
out_buffer[block_idx.x] = tensor.load[4](block_idx.x, 0).reduce_add()
ctx.enqueue_function[simd_reduce_kernel](
tensor,
out_dev,
grid_dim=blocks,
block_dim=1, # one thread per block
)
ctx.synchronize()
# Ensure we have the same result
ctx.enqueue_copy_from_device(out_host.unsafe_ptr(), out_dev)
ctx.synchronize()
print(out_host.unsafe_ptr().load[width=blocks]())[6, 22, 38, 54]This is much cleaner and faster, instead of 4 threads writing to shared memory, we're using 1 thread per block to do a single SIMD reduction. Shared memory has many uses, but as you learn more tools you'll be able decipher which is the most performant for your particular problem.
Warps
A warp is a group of threads (32 on NVIDIA, 64 on AMD) within a block. Threads within the same warp can synchronize their execution, and at a particular step perform SIMD instructions using values from the other threads in lockstep. We have only 4 threads within each block, well under the 32 limit, if this wasn't the case you'd have to do two reductions, one from each warp to shared memory, then another from shared memory to the output buffer/tensor.
Lets write your first warp reduction kernel:
from gpu import warpfn warp_reduce_kernel(tensor: LayoutTensor[dtype, layout], out_buffer: DeviceBuffer[dtype]):
# Warp operations don't support UInt8, so we cast to UInt32
var value = tensor.load[1](block_idx.x, thread_idx.x)
# Each thread gets the value from one thread higher, summing them as they go
var result = warp.sum(value)
# Wait for all the threads in the warp to finish reducing
barrier()
# Thread 0 has the reduced sum of the values from all the other threads
if thread_idx.x == 0:
out_buffer[block_idx.x] = result
ctx.enqueue_function[warp_reduce_kernel](
tensor,
out_dev,
grid_dim=blocks,
block_dim=threads,
)
ctx.synchronize()
# Ensure we have the same result
ctx.enqueue_copy_from_device(out_host.unsafe_ptr(), out_dev)
ctx.synchronize()
print(out_host.unsafe_ptr().load[width=blocks]())[6, 22, 38, 54]It might not be immediately clear what's happening with the warp.sum function, so let's break it down with print statements:
fn warp_print_kernel(tensor: LayoutTensor[dtype, layout]):
var value = tensor.load[1](block_idx.x, thread_idx.x)
# Get the value from the thread 1 position higher
var next_value = warp.shuffle_down(value, offset=1)
if block_idx.x == 0:
print("thread:", thread_idx.x, "value:", value, "next_value:", next_value)
ctx.enqueue_function[warp_print_kernel](
tensor,
grid_dim=blocks,
block_dim=threads,
)
ctx.synchronize()CUDA call failed: CUDA_ERROR_ILLEGAL_ADDRESS (an illegal memory access was encountered)In the above print statements, you can see that each thread is getting the value from the thread 1 position higher in the block. When we get to the 4th thread it returns 0 as there is no higher thread. Instead of using warp.sum we can write our own custom reduction, lets multiply the result at each step by 2:
@parameter
fn custom_reduce[
type: DType, width: Int
](lhs: SIMD[type, width], rhs: SIMD[type, width]) -> SIMD[type, width]:
return lhs + rhs
fn custom_warp_reduce_kernel(tensor: LayoutTensor[dtype, layout], out_buffer: DeviceBuffer[dtype]):
var value = tensor.load[1](block_idx.x, thread_idx.x)
var result = warp.reduce[warp.shuffle_down, custom_reduce](value)
barrier()
if block_idx.x == 0:
print("thread:", thread_idx.x, "value:", value, "result:", result)
if thread_idx.x == 0:
out_buffer[block_idx.x] = result
ctx.enqueue_function[custom_warp_reduce_kernel](
tensor,
out_dev,
grid_dim=blocks,
block_dim=threads,
)
# Check our new result
print("Block 0 reduction steps:")
ctx.enqueue_copy_from_device(out_host.unsafe_ptr(), out_dev)
ctx.synchronize()
print("\nAll blocks reduced output buffer:")
print(out_host.unsafe_ptr().load[width=blocks]())Block 0 reduction steps:
thread: 0 value: 0 result: 6
thread: 1 value: 1 result: 6
thread: 2 value: 2 result: 5
thread: 3 value: 3 result: 3
All blocks reduced output buffer:
[6, 22, 38, 54]You can see in the output that the first block had the values [ 0 1 2 3 ] and was reduced from top to bottom (shuffle down) in this way, where the result of one thread is passed to the next thread down:
thread 3: value=3 next_value=N/A result=3
thread 2: value=2 next_value=3 result=5
thread 1: value=1 next_value=5 result=6
thread 0: value=0 next_value=6 result=6Exercies
Now that you've learnt many of the core primitives for GPU programming, here are some exercies to cement some of the knowledge. Feel free to go back and look at the examples.
- Create a host buffer for the input of dtype Int64, with 32 elements, and initialize the numbers ordered sequentially. Copy the host buffer to the device.
- Create a tensor that wraps the host buffer, with the dimensions (8, 4)
- Create an host and device buffer for the output of dtype Int64, with 8 elements.
- Launch a GPU kernel with 8 blocks and 4 threads that takes every value and raises it to the power of 2 using x**2, then does a reduction using your preferred method to write to the output buffer.
- Copy the device buffer to the host buffer, and print it out on the CPU.