Introduction to GPGPU computing with CUDA

Exploring the CUDA environment

Log on to one of the workstations and open a terminal. Make sure the CUDA module is loaded
```
module load cuda
```
(You may want to type module save so that the CUDA module is loaded by default).

Type deviceQuery and examine the output. It should resemble the output shown below.

/usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Quadro K620"
  CUDA Driver Version / Runtime Version          9.0 / 9.0
  CUDA Capability Major/Minor version number:    5.0
  Total amount of global memory:                 1999 MBytes (2096431104 bytes)
  ( 3) Multiprocessors, (128) CUDA Cores/MP:     384 CUDA Cores
  GPU Max Clock rate:                            1124 MHz (1.12 GHz)
  Memory Clock rate:                             900 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS

Notice the three lines that are in bold.

The first reports the Compute Capability of the device installed on your workstation. This is 5.0 on three of our workstations (Habakkuk, Zephaniah, and Zechariah), 6.1 on Amos, and 3.0 on the eight other workstations.
The second line reports the number of CUDA cores (streaming processors) that are available. Notice how they are organized; either two multiprocessors (called SMs in our text) with 192 cores each, or three multiprocessors with 128 cores each.
The third line indicates that compute kernels will only be allowed to run for a short time on the GPU. This is because the GPUs on the workstations are also actually acting as display controllers and so cannot be taken over for compute jobs. The time limit is in seconds, so we don't need to do anything to change it for today's exercise.

There is a lot of other useful information here; take a moment to read through it and ask questions if you want to. The deviceQuery program is part of a collection of sample programs distributed with the CUDA SDK (Software Development Kit). On our workstations the sources and executables for the samples can be found in /usr/local/cuda/samples/ and /usr/local/cuda/samples/bin/x86_64/linux/release respectively.

An example CUDA program

Change to your cps343-hoe directory and update your repository using git fetch and git pull. Change into the 08-intro-to-cuda directory.
Examine the source code in add-vectors.cu until you are comfortable with its operation. In particular, be sure you can identify which parts of the program correspond with each part of the pattern described in the program's heading comments.

Compile and run the program:

nvcc -o add-vectors add-vectors.cu
./add-vectors

The output will probably not be too exciting but should convince you the program is working correctly. Try running the program with different vector lengths
```
./add-vectors 5
./add-vectors 50
./add-vectors 10000
./add-vectors 100000000
```
The program doesn't display vectors longer than 100 elements, so the last two commands won't produce any output. Notice, however, that the computation is correct for a range of sizes, even though our block size was set to 16.
CUDA SDKs since version 5.0 have included a profiler. You do not need to instrument and/or recompile your code; just run the profiler with your program and any arguments:
```
nvprof ./add-vectors 1000
```
The output will timing information for each CUDA function. Notice that the program spends most of its time allocating memory on the device when the vector length is 1000. Now try
```
nvprof ./add-vectors  100000000
```
and you should find very different behavior; the time to copy memory to and from the device is the dominant time.

Now it's your turn

Exercise: Write a program that initializes two M×N matrices and computes the sum of the two matrices on the GPU device. After copying the result back to the host, your program should print out the result matrix if N≤10. You may use add-vectors.cu as a starting point or start from scratch.

It is natural to use a 2D grid for a matrix. In this case the block_size and num_blocks variables should be of type dim3. The kernel launch area show below accomplishes this

  dim3 block_size( 16, 16 );
  dim3 num_blocks( ( n - 1 + block_size.x ) / block_size.x, 
                   ( m - 1 + block_size.y ) / block_size.y );
  add_matrices<<< num_blocks, block_size >>>( c_d, a_d, b_d, m, n );

Of course, the kernel code will need to work correctly with a 2D grid rather than the 1D grid used in add-vectors.cu.

Test your code with a range of values of M and N. For each case, run your program both without and with the profiler.

What to turn in

Please turn in a printout of your final matrix-addition source code along with a short report summarizing the profiling data.