CUDA Shared memory

Using device shared memory

  1. Make sure your repository is up-to-date:

    git fetch && git pull
    
  2. Change into the 09-cuda-shmem directory.

  3. Examine the source code in matmul.cu. It has two different kernels that perform matrix-matrix multiplication. The first uses only the GPU device's global memory while the second uses the GPU's shared memory.

    Most of the CUDA runtime calls in the main() function should look familiar, but now they are “wrapped” in a function that checks their return value to make sure no error occurred. Two new CUDA functions not seen in the previous hands-on exercise are cudaDeviceSynchronize() (waits for kernel to finish) and cudaGetLastError() (checks if error occurred during kernel execution).

  4. Compile the program with make. This will build three different executables, each with a different CUDA block size. Each program takes the matrix dimension as a command-line argument. Run the programs with several different values; try 10, 100, 1000, 2000, and others, but do not exceed 4000. Notice the times reported for the matrix-matrix products and the speed-up afforded by using the GPU device's shared memory.

  5. Examine the shell script collect_data.sh. This will run each of the three programs with a range of matrix dimensions (1580 through 1620). The tee program will split the output stream into two streams, one directed to the terminal (or wherever standard output is going) and the other to the named file. The script then separates the global-memory kernel and shared-memory kernel timing data into individual files.

  6. Run the shell script with

    ./collect_data.sh      
    

    This will take a minute or so. Once it's finished you'll find a bunch of new files in your directory with names like Global.QuadroK620.16 and Shared.QuadroK620.16. The first part of the name indicates whether the data is from global or shared memory kernels. The second part indicates the GPU device, and the third part is the CUDA block dimension.

    Assuming you've done this on Zechariah (a.k.a. nabi.cs.gordon.edu), use ssh to connect to Haggai, Hosea, Joel, Jonah, Malachi, Micah, Nahum, or Obadiah and run the command again. Once you've collected the data you can exit the shell to get back to Zechariah. For the curious, you can also try this on Amos.

    Start the gnuplot program and try the following graphing commands typed at the gnuplot> prompt (replace “QuadroK620” with “QuadroK2000” or “QuadroP400” as appropriate):

    set key center left
    set style data linespoints
    set xlabel "Matrix dimension"
    set ylabel "Time (seconds)"
    set title "QuadroK620 matrix-matrix timing data"
    plot 'Global.QuadroK620.16','Global.QuadroK620.32','Shared.QuadroK620.16','Shared.QuadroK620.32'
        

    Be sure you plot data from each of the data files and compare them with one another. Look for interesting patterns. If possible, compare your results with someone who is working on a machine with the other type of GPU.

Assignment

Submit answers to the following questions by Friday.
  1. Which kernel performs faster?
  2. Which kernel has more consistent performance?
  3. What is the pattern of performance variations in the less consistent kernel?
  4. Do a web search for “cuda bank conflicts” and explain the behavior you observed might be related to this phenomenon.