There are numerous tutorials online to teach how to use CUDA C++ for parallel programming. And we will adopt some of them. The best way to learn is to read the official documents. This tutorial, however, will start from how to write the code at the beginning. We will illustrate the architecture part throughout the code.

We will take few days to go over the principle of CUDA programming. Afterwards, we will read over the ML code to see how the operations are implemented in CUDA core. There is no strict prerequisite (except basic programming), but it would be better to know OS and CPU (a little bit).

When encountering a new language, the first thing is we should write the “hello world” program. Before that, you should make sure we have installed the CUDA environment. We will not cover that part.`

#include<stdio.h>
__global__ void hello_world(void)
{
	printf("Hello World from GPU!\n");
}

int main(int argc, char **argv)
{
	printf("Hello World from CPU!\n");
	
	hello_world<<<1, 10>>>();
	
	cudaDeviceReset(); // Synchronize the device
	
	return 0;
}

You can use

nvcc hello_world.cu -o hello
./hello

to run the program. If you cannot find nvcc and do install Nvidia driver, you can check whether /usr/local/cuda/bin/nvcc exists. Add it in your environment path:

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

As we can see, the majority of the code is written in C-style as we are familiar with. We write a function hello_world and call it in the main function. __global__ tells the compiler the function can be run in the kernel. <<<1, 10>>> is the configuration for the device. We do not need to know the exact meaning for now. But you can try different numbers on each position and watch out the output. cudaDeviceReset() is used to synchronize the GPU and CPU programs. If we comment out this line, we will find that only CPU is output. This is because the execution of CPU and GPU is async. The success call of kernel function will be followed by the main thread immediately, no matter whether the kernel function in GPU has finished. Therefore, we need to wait until the finishing of GPU function.

Cool, we have started already. Basically, a CUDA program will be divided into the following steps:

  1. Allocate GPU memory
  2. Copy memory to the device
  3. Call CUDA Kernal Function to compute
  4. Copy the result to the main thread
  5. Free the memory It is not a easy process. And the key to the CUDA programming is to understand the architecture of GPU. We are touching something related to the hardware (but not the hardware itself). CUDA provides an abstraction of the hardware and a hierarchical organization of thread and memory. We need to first learn the programming model of GPU first.