Introduction to GPU programming

Instead of going over theory first, we're going to dive straight into writing GPU code and explain the concepts as we go. This is a top-down learning approach where you get context for "what" happens, and then dive into the "why".

Your first kernel

In the context of GPU programming, a kernel is a program that runs on each thread that you launch. There are various ways to communicate between threads, this is just a simple example to demonstrate what's happening:

mojo

from gpu import thread_idx

fn printing_kernel():
    print("GPU thread: [", thread_idx.x, thread_idx.y, thread_idx.z, "]")

We can pass this function as a parameter to enqueue_function to compile it for your attached GPU and launch it. First we need to get the DeviceContext for your GPU:

mojo

from gpu.host import DeviceContext

var ctx = DeviceContext()

Now we have the DeviceContext we can compile and launch the kernel:

mojo

ctx.enqueue_function[printing_kernel](grid_dim=1, block_dim=4)

# Wait for the kernel to finish executing before handing back to CPU
ctx.synchronize()

text

/tmp/mdlab/main.mojo:95:83: error: use of unknown declaration 'DeviceBuffer'
    fn custom_warp_reduce_kernel(tensor: LayoutTensor[dtype, layout], out_buffer: DeviceBuffer[dtype]):
                                                                                  ^~~~~~~~~~~~
mojo: error: failed to parse the provided Mojo source module

Threads

Because we passed block_dim=4, we launched 4 threads on the x dimension, the kernel code we wrote is executed on each thread. The printing can be out of order depending on which thread reaches that print call first.

Now lets add the y and z dimensions with block_dim(2, 2, 2):

mojo

ctx.enqueue_function[printing_kernel](grid_dim=1, block_dim=(2, 2, 2))
ctx.synchronize()

text

GPU thread: [ 0 0 0 ]
GPU thread: [ 1 0 0 ]
GPU thread: [ 0 1 0 ]
GPU thread: [ 1 1 0 ]
GPU thread: [ 0 0 1 ]
GPU thread: [ 1 0 1 ]
GPU thread: [ 0 1 1 ]
GPU thread: [ 1 1 1 ]

We're now launching 8 (2x2x2) threads in total.

Host vs Device

If you see the enqueue word in a method call it means it's asynchronous, you have to await it's completion where appropriate using ctx.synchronize, this ensure the GPU has completed it's work. For example, if you were to print from the CPU without first calling synchronize the CPU will likely print out of order before the GPU has completed it's work:

mojo

ctx.enqueue_function[printing_kernel](grid_dim=1, block_dim=4)
print("This will finish before the GPU has completed it's work")
ctx.synchronize()

text

This will finish before the GPU has completed it's work
GPU thread: [ 0 0 0 ]
GPU thread: [ 1 0 0 ]
GPU thread: [ 2 0 0 ]
GPU thread: [ 3 0 0 ]

You'll see the word host which refers to the CPU that coordinates the kernel launches, device refers to the accelerator which in this case is a GPU. In the above example we failed to call synchronize before printing on the host, the device will be slightly slower to finish it's work, so you see that output after the host output. Let's fix the placement of synchronize:

mojo

ctx.enqueue_function[printing_kernel](grid_dim=1, block_dim=4)
ctx.synchronize()
print("This will finish after the GPU has completed it's work")

text

GPU thread: [ 0 0 0 ]
GPU thread: [ 1 0 0 ]
GPU thread: [ 2 0 0 ]
GPU thread: [ 3 0 0 ]
This will finish after the GPU has completed it's work

Blocks

Lets set up a new kernel to demonstrate how blocks work:

mojo

from gpu import block_idx

mojo

fn block_kernel():
    print(
        "block: [",
        block_idx.x,
        block_idx.y,
        block_idx.z,
        "]",
        "thread: [",
        thread_idx.x,
        thread_idx.y,
        thread_idx.z,
        "]"
    )

ctx.enqueue_function[block_kernel](grid_dim=(2, 2), block_dim=2)
ctx.synchronize()

text

block: [ 1 1 0 ] thread: [ 0 0 0 ]
block: [ 1 1 0 ] thread: [ 1 0 0 ]
block: [ 0 0 0 ] thread: [ 0 0 0 ]
block: [ 0 0 0 ] thread: [ 1 0 0 ]
block: [ 1 0 0 ] thread: [ 0 0 0 ]
block: [ 1 0 0 ] thread: [ 1 0 0 ]
block: [ 0 1 0 ] thread: [ 0 0 0 ]
block: [ 0 1 0 ] thread: [ 1 0 0 ]

We're still launching 8 (2x2x2) threads, where there are 4 blocks, each with 2 threads. In GPU programming this grouping of blocks and threads is important, each block has can have it's own fast shared VRAM (Video Random Access Memory) which allows threads to communicate. The threads within a block can also communicate by through registers, we'll cover this concept when we get to warps. For now the important imformation to internalize is:

grid_dim defines how many blocks are launched
block_dim defines how many threads are launched in each block

Tiles

The x, y, z dimensions of blocks are important for splitting up large jobs into tiles so each thread can work on it's own subset of the problem. Lets vizualize how a contiguous array of data can be split up into tiles, if we have an array of UInt8 (Unsigned Integer 8bit) data like:

plaintext

[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ]

We could split work up between threads and blocks, we're only going to use the x dimension for threads and blocks to get started:

plaintext

Thread  |    0  1  2  3
-------------------------
block 0 | [  0  1  2  3 ]
block 1 | [  4  5  6  7 ]
block 2 | [  8  9 10 11 ]
block 3 | [ 12 13 14 15 ]

If you had a much larger data array you could further split it up in into tiles, e.g. tile with widths [2, 2] at index (0, 0) would be:

plaintext

[ 0 1 ]
[ 4 5 ]

And index (1, 0) would be:

plaintext

[ 2 3 ]
[ 6 7 ]

This is where you'd introduce the y dimension, we'll cover this later but for now we're going to focus on how blocks and threads interact.

Host Buffer

First we'll initialize that contiguous array on CPU and fill in the values:

mojo

from math import iota

alias dtype = DType.uint32
alias blocks = 4
alias threads = 4
# One value per thread
alias in_bytes = blocks * threads

# Allocate data on the host and return a buffer which owns that data
var in_host = ctx.enqueue_create_host_buffer[dtype](in_bytes)

# Fill in the buffer with values from 0 to 16
iota(in_host.unsafe_ptr(), in_bytes)

# Load all the data as a SIMD vector of width 16 and print it
print(in_host.unsafe_ptr().load[width=in_bytes](0))

text

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

In the last call where we print it using load[width=in_bytes] we're loading a SIMD vector of width 16. SIMD means Single Instruction Multiple Data, in Mojo all the core numerical dtypes are built around SIMD, allowing you to e.g. multiply a vector in a single instruction using special CPU registers, instead of 16 instructions for each element.

Device Buffer

We now have a host buffer that we can copy to the GPU:

mojo

# Allocate a buffer for the GPU
var in_dev = ctx.enqueue_create_buffer[dtype](in_bytes)

# Copy the data from the CPU to the GPU buffer
ctx.enqueue_copy_to_device(in_dev, in_host.unsafe_ptr())

Creating the GPU buffer is allocating global memory which can be accessed from any block and thread, this memory is relatively slow compared to shared memory which is shared between blocks, more on that later.

Tensor indexing from threads

Now that we have the data set up, we can wrap the data in a LayoutTensor so that we can reason about how to index into the array, allowing each thread to get its corrosponding value:

mojo

from layout import Layout, LayoutTensor

# Row major: elements are stored sequentially in memory [0, 1, 2, 3, 4, 5, ...]
# Column major: used in some GPU optimizations, stored as [0, 4, 8, 12, 1, ...]
alias layout = Layout.row_major(blocks, threads)

var tensor = LayoutTensor[dtype, layout](in_dev.unsafe_ptr())

This LayoutTensor is a mutable view over the data stored inside in_dev, it doesn't own it's memory but allows us to index using block and thread ids. Initially we'll just print the values to confirm it's indexing as we expect:

mojo

fn print_values_kernel(tensor: LayoutTensor[dtype, layout]):
    var bid = block_idx.x
    var tid = thread_idx.x
    print("block:", bid, "thread:", tid, "val:", tensor[bid, tid])

ctx.enqueue_function[print_values_kernel](
    tensor, grid_dim=blocks, block_dim=threads,
)
ctx.synchronize()

text

block: 3 thread: 0 val: 12
block: 3 thread: 1 val: 13
block: 3 thread: 2 val: 14
block: 3 thread: 3 val: 15
block: 0 thread: 0 val: 0
block: 0 thread: 1 val: 1
block: 0 thread: 2 val: 2
block: 0 thread: 3 val: 3
block: 2 thread: 0 val: 8
block: 2 thread: 1 val: 9
block: 2 thread: 2 val: 10
block: 2 thread: 3 val: 11
block: 1 thread: 0 val: 4
block: 1 thread: 1 val: 5
block: 1 thread: 2 val: 6
block: 1 thread: 3 val: 7

As in the vizualization above, the block/thread is getting the corrosponding value that we expect. You can see block: 3 thread: 3 has the last value 15.

Mulitply Kernel

Now that we've verified we're getting the correct values when indexing, lets launch a kernel to multiply each value:

mojo

fn multiply_kernel[multiplier: Int](tensor: LayoutTensor[dtype, layout]):
    tensor[block_idx.x, thread_idx.x] *= multiplier

ctx.enqueue_function[multiply_kernel[2]](
    tensor,
    grid_dim=blocks,
    block_dim=threads,
)
ctx.synchronize()

# Copy data back to host and print as 2D array
ctx.enqueue_copy_from_device[dtype](in_host.unsafe_ptr(), in_dev)
var host_tensor = LayoutTensor[dtype, layout](in_host.unsafe_ptr())
print(host_tensor)
_ = in_host.unsafe_ptr()

text

0 1 2 3 
4 5 12 14 
16 18 20 22 
24 26 28 30

Congratulations! You've successfully run a kernel that modifies values from your GPU, and printed the result on your CPU. You can see above that each thread multiplied a single value by 2 in parallel on the GPU, and copied the result back to the CPU.

Sum Reduce Output Tensor

We're going to set up a new buffer which will have all the reduced values with the sum of each thread in the block:

plaintext

Output: [ block[0] block[0] block[1] block[1] ]

There's no need for a layout tensor here as it's a 1D array:

mojo

# Allocate to global memory
var out_dev = ctx.enqueue_create_buffer[dtype](blocks)

# Also create a buffer for the host
var out_host = ctx.enqueue_create_host_buffer[dtype](blocks)
ctx.synchronize()

The problem here is that we can't have all the threads summing their values into the same index in the output buffer as that will introduce race conditions. We're going to intoduce new concepts to deal with this.

Shared memory

This is not an optimal solution, but it's a good way to introduce shared memory in a simple example, we'll cover better solutions in the next sections. You can allocate data for each block when launching a kernel, this is called shared memory and every thread within a block can communicate using it.

mojo

from gpu import barrier, lane_id
from gpu.memory import AddressSpace, external_memory
from sys import sizeof
from gpu.host import DeviceBuffer

mojo

fn sum_reduce_kernel(tensor: LayoutTensor[dtype, layout], out_buffer: DeviceBuffer[dtype]):

    # Get a pointer to shared memory for the indices and values
    var shared = external_memory[
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
        alignment = sizeof[dtype](),
    ]()

    # Place the corrosponding value into shared memory
    shared[thread_idx.x] = tensor[block_idx.x, thread_idx.x][0]

    # Await all the threads to finish loading their values into shared memory
    barrier()

    # If this is the first thread, sum and write the result to global memory
    if thread_idx.x == 0:
        for i in range(threads):
            out_buffer[block_idx.x] += shared[i]

ctx.enqueue_function[sum_reduce_kernel](
    tensor,
    out_dev,
    grid_dim=blocks,
    block_dim=threads,
    shared_mem_bytes=blocks * sizeof[dtype](), # Shared memory betwee blocks
)
ctx.synchronize()

# Copy the data back to the host and print out the SIMD vector
ctx.enqueue_copy_from_device(out_host.unsafe_ptr(), out_dev)
ctx.synchronize()
print(out_host.unsafe_ptr().load[width=blocks]())

text

[6, 22, 38, 54]

For our first block/tile we summed the values:

plaintext

sum([ 0 1 2 3 ]) == 6

And the reduction resulted in the output having the sum of 6 in the first position. Every tile/block has been reduced to:

plaintext

[ 6 22 38 54]

Thread level SIMD

We could skip using shared memory altogether using SIMD instructions, this is a faster option to consider if it suits your problem. Each thread has access to SIMD registers which can perform operations on a vector such as reductions. Here we'll be launching one thread per block, loading the 4 corrosponding values from that block as a SIMD vector, and summing them together in a single operation:

mojo

fn simd_reduce_kernel(tensor: LayoutTensor[dtype, layout], out_buffer: DeviceBuffer[dtype]):
    out_buffer[block_idx.x] = tensor.load[4](block_idx.x, 0).reduce_add()

ctx.enqueue_function[simd_reduce_kernel](
    tensor,
    out_dev,
    grid_dim=blocks,
    block_dim=1, # one thread per block
)
ctx.synchronize()

# Ensure we have the same result
ctx.enqueue_copy_from_device(out_host.unsafe_ptr(), out_dev)
ctx.synchronize()
print(out_host.unsafe_ptr().load[width=blocks]())

text

[6, 22, 38, 54]

This is much cleaner and faster, instead of 4 threads writing to shared memory, we're using 1 thread per block to do a single SIMD reduction. Shared memory has many uses, but as you learn more tools you'll be able decipher which is the most performant for your particular problem.

Warps

A warp is a group of threads (32 on NVIDIA, 64 on AMD) within a block. Threads within the same warp can synchronize their execution, and at a particular step perform SIMD instructions using values from the other threads in lockstep. We have only 4 threads within each block, well under the 32 limit, if this wasn't the case you'd have to do two reductions, one from each warp to shared memory, then another from shared memory to the output buffer/tensor.

Lets write your first warp reduction kernel:

mojo

from gpu import warp

mojo

fn warp_reduce_kernel(tensor: LayoutTensor[dtype, layout], out_buffer: DeviceBuffer[dtype]):
    # Warp operations don't support UInt8, so we cast to UInt32
    var value = tensor.load[1](block_idx.x, thread_idx.x)

    # Each thread gets the value from one thread higher, summing them as they go
    var result = warp.sum(value)

    # Wait for all the threads in the warp to finish reducing
    barrier()

    # Thread 0 has the reduced sum of the values from all the other threads
    if thread_idx.x == 0:
        out_buffer[block_idx.x] = result


ctx.enqueue_function[warp_reduce_kernel](
    tensor,
    out_dev,
    grid_dim=blocks,
    block_dim=threads,
)
ctx.synchronize()

# Ensure we have the same result
ctx.enqueue_copy_from_device(out_host.unsafe_ptr(), out_dev)
ctx.synchronize()
print(out_host.unsafe_ptr().load[width=blocks]())

text

[6, 22, 38, 54]

It might not be immediately clear what's happening with the warp.sum function, so let's break it down with print statements:

mojo

fn warp_print_kernel(tensor: LayoutTensor[dtype, layout]):
    var value = tensor.load[1](block_idx.x, thread_idx.x)

    # Get the value from the thread 1 position higher
    var next_value = warp.shuffle_down(value, offset=1)

    if block_idx.x == 0:
        print("thread:", thread_idx.x, "value:", value, "next_value:", next_value)

ctx.enqueue_function[warp_print_kernel](
    tensor,
    grid_dim=blocks,
    block_dim=threads,
)
ctx.synchronize()

text

CUDA call failed: CUDA_ERROR_ILLEGAL_ADDRESS (an illegal memory access was encountered)

In the above print statements, you can see that each thread is getting the value from the thread 1 position higher in the block. When we get to the 4th thread it returns 0 as there is no higher thread. Instead of using warp.sum we can write our own custom reduction, lets multiply the result at each step by 2:

mojo

@parameter
fn custom_reduce[
    type: DType, width: Int
](lhs: SIMD[type, width], rhs: SIMD[type, width]) -> SIMD[type, width]:
    return lhs + rhs

fn custom_warp_reduce_kernel(tensor: LayoutTensor[dtype, layout], out_buffer: DeviceBuffer[dtype]):
    var value = tensor.load[1](block_idx.x, thread_idx.x)
    var result = warp.reduce[warp.shuffle_down, custom_reduce](value)

    barrier()

    if block_idx.x == 0:
        print("thread:", thread_idx.x, "value:", value, "result:", result)

    if thread_idx.x == 0:
        out_buffer[block_idx.x] = result

ctx.enqueue_function[custom_warp_reduce_kernel](
    tensor,
    out_dev,
    grid_dim=blocks,
    block_dim=threads,
)

# Check our new result
print("Block 0 reduction steps:")
ctx.enqueue_copy_from_device(out_host.unsafe_ptr(), out_dev)
ctx.synchronize()

print("\nAll blocks reduced output buffer:")
print(out_host.unsafe_ptr().load[width=blocks]())

text

Block 0 reduction steps:
thread: 0 value: 0 result: 6
thread: 1 value: 1 result: 6
thread: 2 value: 2 result: 5
thread: 3 value: 3 result: 3

All blocks reduced output buffer:
[6, 22, 38, 54]

You can see in the output that the first block had the values [ 0 1 2 3 ] and was reduced from top to bottom (shuffle down) in this way, where the result of one thread is passed to the next thread down:

plaintext

thread 3: value=3   next_value=N/A   result=3
thread 2: value=2   next_value=3     result=5
thread 1: value=1   next_value=5     result=6
thread 0: value=0   next_value=6     result=6

Exercies

Now that you've learnt many of the core primitives for GPU programming, here are some exercies to cement some of the knowledge. Feel free to go back and look at the examples.

Create a host buffer for the input of dtype Int64, with 32 elements, and initialize the numbers ordered sequentially. Copy the host buffer to the device.
Create a tensor that wraps the host buffer, with the dimensions (8, 4)
Create an host and device buffer for the output of dtype Int64, with 8 elements.
Launch a GPU kernel with 8 blocks and 4 threads that takes every value and raises it to the power of 2 using x**2, then does a reduction using your preferred method to write to the output buffer.
Copy the device buffer to the host buffer, and print it out on the CPU.

Introduction to GPU programming ​

Your first kernel ​

Threads ​

Host vs Device ​

Blocks ​

Tiles ​

Host Buffer ​

Device Buffer ​

Tensor indexing from threads ​

Mulitply Kernel ​

Sum Reduce Output Tensor ​

Shared memory ​

Thread level SIMD ​

Warps ​

Exercies ​

Introduction to GPU programming

Your first kernel

Threads

Host vs Device

Blocks

Tiles

Host Buffer

Device Buffer

Tensor indexing from threads

Mulitply Kernel

Sum Reduce Output Tensor

Shared memory

Thread level SIMD

Warps

Exercies