First of all, we may need a review for the very basic CUDA programming and GPU execution principle. Here, I want to use the diagram below to demonstrate the threadings on GPU with a very simple array addition operation example.

Since each ‘worker’ has his/her own unique ID with which we can assign a unique task for each of them. Therefore, the actual order of the job to be carried out does not really matter that much. That’s why we see that in the diagram above, the #35 part of the job is shown above #10. This is just to demonstrate the idea that order here does not matter. In practice, they all happen in parallel and no one really knows or cares which one of them gets executed/finished first.
Based on the further discussion above, we now move on to a slightly complicated case, where we are going to do matrix multiplication with GPU. First, the basic principle of conducting matrix multiplication is shown in the following picture,

To conduct the matrix multiplication in a parallel way on GPU, the fundamental idea is to execute each of the block within dashed boxes, independently on individual thread on GPU. To realize that, the basic workflow is presented in the following picture,

Following the coding scheme as presented above, the fundamental difference as compared to usual serial coding here is that we represent the matrix in a one-dimensional manner, as indicated by the red arrows on the top of the figure above. The purpose of such a representation is for the ease of coding on GPU kernel, as we will see in the code snippet that will be presented later in current blog. Once we transform our matrix to its one dimensional form, the next step is again to let ‘workers’ shout out their unique ID, as we have discussed above. In this case, for demo purpose, assume we have a single block which contains (3\times2) thread structure, then in practice we have a one-to-one mapping between ‘workers’ and the entries of the final product matrix, i. e. each ‘worker’ is responsible for one entry in the final product matrix and none of them is wasted. The reason why we want to point out ‘none is wasted’ is that in practice, it often happens the overall GPU block unit is larger than what we really need. For example, in current example, if we claim a single GPU block with (5\times4) thread structure, we can still find right amount of ‘workers’ to do jobs for us, but in this case, for sure some of the ‘workers’ will not be assigned jobs, thus ‘wasted’. Back to our workflow, once all ‘workers’ know what they are responsible for, they can go ahead to do their jobs, i. e. multiplying the right row and column of the input matrices and sum up the result to obtain the entry for the final product matrix, as indicated by the part of the diagram above containing a whole bunch of grouped (by ‘worker’) squares.
The explanation presented above about matrix multiplication with CUDA is based on the original blog post in Ref. [1]. Also, the code snippets there are reproduced here in current blog, with detailed comments for better understanding. A makefile, together with instructions about how-to, is also provided at the end of current blog, with which one can straightforwardly compile the code given below.

main_mat_mul.cu

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 #include #include #include #include #include #include "kernel.h" #include "kernel.cu" #include "dev_array.h" #include using namespace std; int main() { // Perform matrix multiplication C = A * B // where A, B and C are NxN matrices. int N = 16; int SIZE = N*N; // Allocate memory on the host. vector h_A(SIZE); vector h_B(SIZE); vector h_C(SIZE); // Initialize matrices on the host. for (int i=0; i d_A(SIZE); dev_array d_B(SIZE); dev_array d_C(SIZE); // The 'set' method defined in the 'dev_array.h' file is responsible for // copying data from host to device. Here, the first parameter '&h_A[0]' or // '&h_B[0]' is a pointer specifying the starting position in memory to // copy from. The second parameter 'SIZE' specifies the number of 'blocks' // we will copy. Detailed explanation about 'block' will be given in the // 'dev_array.h' file (refer to the definition of 'set' method there). d_A.set(&h_A[0], SIZE); d_B.set(&h_B[0], SIZE); // Now, we do the matrix multiplication on GPU, by calling the function // below, which is defined in another separate .cu file - 'kernel.cu' in // the header of current file. Within 'kernel.cu' file, the fundamental // kernel function will be defined and called, which is the real part that // is executing the matrix multiplication. Therefore, the calling scheme // can be roughly depicted as follows: // // call call // Current file -----> matrixMultiplication -----> kernel // matrixMultiplication(d_A.getData(), d_B.getData(), d_C.getData(), N); cudaDeviceSynchronize(); // The 'get' function is also defined in 'dev_array.h' file, as the 'set' // function above. It is responsible for copying data from device to host. d_C.get(&h_C[0], SIZE); cudaDeviceSynchronize(); // Next, we are going to do the same thing as above, i. e. matrix // multiplication. This time, we will be doing it on CPU only. float *cpu_C; cpu_C=new float[SIZE]; // This is the familiar way of doing matrix multiplication in a serial // manner on CPU. float sum; for (int row=0; row

dev_array.h

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 #ifndef _DEV_ARRAY_H_ #define _DEV_ARRAY_H_ #include #include #include // Template of type - it's just like a place holder, specifying the declared // variable is of SOME type (not known when being declared). Only when the // declared variable is used do we know what its type is. // See the following site for detailed explanation: // http://www.cplusplus.com/doc/oldtutorial/templates/ template class dev_array { public: // Refer to the following link for explanation about 'explicit': // https://stackoverflow.com/questions/121162/what-does-the-explicit-keyword-mean // // Here follows we can find explanation about the functionality of ":": // https://stackoverflow.com/questions/2785612/c-what-does-the-colon-after-a-constructor-mean // // Basicaly, we are initializing variables here. explicit dev_array() : start_(0), end_(0) {} // Refer to the following link for explanation about data type 'size_t': // https://en.cppreference.com/w/cpp/types/size_t explicit dev_array(size_t size) { // The 'allocate' function is responsible for allocating memory on // GPU - refer to the definition of 'allocate' function in the end of // current file - the private functions part. allocate(size); } ~dev_array() { free(); } // Both the variables 'start_' and 'end_', as we will see later in current // file, are pointers - 'start_' points to the starting position (in // memory) and 'end_' points to the end. Therefore, the function defined // here follows returns the size of the allocated array on GPU. // Refer to the following site for explanation about const member function // (i. e. the 'const' keyword after the function name below): // https://docs.microsoft.com/en-us/cpp/cpp/const-cpp?view=vs-2019 size_t getSize() const { return end_ - start_; } // Return the starting position in memory, as a pointer. T* getData() const { return start_; } // Function responsible for copying data from host to device. 'start_' // specifies where the copied data will start on device; 'src' specifies // the source of the copying on host; 'sizeof(T)' specifies the unit of // the copied block and 'min' here gives how many such units of block will // be copied. void set(const T* src, size_t size) { size_t min = std::min(size, getSize()); cudaError_t result = cudaMemcpy(start_, src, min * sizeof(T), cudaMemcpyHostToDevice); if (result != cudaSuccess) { throw std::runtime_error("failed to copy to device memory"); } } // Similar to the 'set' function above. In this case, we have the copy // direction reversed - from device to host. void get(T* dest, size_t size) { size_t min = std::min(size, getSize()); cudaError_t result = cudaMemcpy(dest, start_, min * sizeof(T), cudaMemcpyDeviceToHost); if (result != cudaSuccess) { throw std::runtime_error("failed to copy to host memory"); } } private: // Allocate memory on device. The starting position in memory on device will // be returned as a pointer. void allocate(size_t size) { cudaError_t result = cudaMalloc((void**)&start_, size * sizeof(T)); if (result != cudaSuccess) { start_ = end_ = 0; throw std::runtime_error("failed to allocate device memory"); } end_ = start_ + size; } // Free memory on device. void free() { if (start_ != 0) { cudaFree(start_); start_ = end_ = 0; } } T* start_; T* end_; }; #endif 

kernel.h

 1 2 3 4 5 6 #ifndef KERNEL_CUH_ #define KERNEL_CUH_ void matrixMultiplication(float *A, float *B, float *C, int N); #endif 

kernel.cu

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 #include #include #include "cuda_runtime.h" #include "kernel.h" #include using namespace std; // This is where the real magic is happening - the matrix multiplication on GPU // in a parallel manner. __global__ void matrixMultiplicationKernel(float* A, float* B, float* C, int N) { // Here, 'workers' are assigned task, according to their unique ID. // N. B. the way used here to assign rows and columns for 'workers' // guarantee that no two 'workers' will be assigned the same row and column // at the same time. int ROW = blockIdx.y*blockDim.y+threadIdx.y; int COL = blockIdx.x*blockDim.x+threadIdx.x; float tmpSum = 0; // Now each 'worker' knows exactly what he/she is responsible, it's time // for them to do their jobs, individually. if (ROW < N && COL < N) { for (int i = 0; i < N; i++) { tmpSum += A[ROW * N + i] * B[i * N + COL]; } } C[ROW * N + COL] = tmpSum; } void matrixMultiplication(float *A, float *B, float *C, int N) { // Here, we use 'dim3' type variables for declaring the number of blocks // per grid and the number of threads per block. // Refer to the following link about 'dim3' in CUDA: // https://codeyarns.github.io/tech/2011-02-16-cuda-dim3.html // or // https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html dim3 threadsPerBlock(N, N); dim3 blocksPerGrid(1, 1); // The IF statement guarantees that we have at most 512 threads per block. if (N*N > 512){ threadsPerBlock.x = 32; threadsPerBlock.y = 16; blocksPerGrid.x = ceil(double(N)/32); blocksPerGrid.y = ceil(double(N)/16); } matrixMultiplicationKernel<<>>(A, B, C, N); } 

========================I AM A SEPARATOR========================

To compile the codes given above, one may want to first save each individual snippet to a separate file (with file name as exactly given on top of each snippet) and also guarantee that all files stay in the same folder. Then one can use the makefile given below to compile the codes to get the executable, by simply executing ‘make’ on terminal.

Makefile

  1 2 3 4 5 6 7 8 9 10 11 12 NCC=nvcc OBJ=kernel.o main_mat_mul.o mat_mul: $(OBJ)$(NCC) -o $@$^ main_mat_mul.o: kernel.o     $(NCC) -c main_mat_mul.cu kernel.o:$(NCC) -c kernel.cu `

References