Log on to one of the workstations and open a terminal. Make sure the CUDA module is loaded
module load cuda(You may want to type
module save
so that the CUDA module
is loaded by default).
Type deviceQuery
and examine the output. It should
resemble the output shown below.
/usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "Quadro K620" CUDA Driver Version / Runtime Version 9.0 / 9.0 CUDA Capability Major/Minor version number: 5.0 Total amount of global memory: 1999 MBytes (2096431104 bytes) ( 3) Multiprocessors, (128) CUDA Cores/MP: 384 CUDA Cores GPU Max Clock rate: 1124 MHz (1.12 GHz) Memory Clock rate: 900 Mhz Memory Bus Width: 128-bit L2 Cache Size: 2097152 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Supports Cooperative Kernel Launch: No Supports MultiDevice Co-op Kernel Launch: No Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1 Result = PASS
Notice the three lines that are in bold.
The first reports the Compute Capability of the device installed on your workstation. This is 5.0 on three of our workstations (Habakkuk, Zephaniah, and Zechariah), 6.1 on Amos, and 3.0 on the eight other workstations.
The second line reports the number of CUDA cores (streaming processors) that are available. Notice how they are organized; either two multiprocessors (called SMs in our text) with 192 cores each, or three multiprocessors with 128 cores each.
The third line indicates that compute kernels will only be allowed to run for a short time on the GPU. This is because the GPUs on the workstations are also actually acting as display controllers and so cannot be taken over for compute jobs. The time limit is in seconds, so we don't need to do anything to change it for today's exercise.
deviceQuery
program is part of a collection of
sample programs distributed with the CUDA SDK (Software Development
Kit). On our workstations the sources and executables for the
samples can be found in /usr/local/cuda/samples/
and /usr/local/cuda/samples/bin/x86_64/linux/release
respectively.
Change to your cps343-hoe
directory and update your
repository using git fetch
and git pull
.
Change into the 08-intro-to-cuda
directory.
Examine the source code in add-vectors.cu
until you
are comfortable with its operation. In particular, be sure you can
identify which parts of the program correspond with each part of the
pattern described in the program's heading comments.
Compile and run the program:
nvcc -o add-vectors add-vectors.cu ./add-vectors
The output will probably not be too exciting but should convince you the program is working correctly. Try running the program with different vector lengths
./add-vectors 5 ./add-vectors 50 ./add-vectors 10000 ./add-vectors 100000000The program doesn't display vectors longer than 100 elements, so the last two commands won't produce any output. Notice, however, that the computation is correct for a range of sizes, even though our block size was set to 16.
nvprof ./add-vectors 1000The output will timing information for each CUDA function. Notice that the program spends most of its time allocating memory on the device when the vector length is 1000. Now try
nvprof ./add-vectors 100000000and you should find very different behavior; the time to copy memory to and from the device is the dominant time.
Exercise:
Write a program that initializes two M×N
matrices and computes the sum of the two matrices on the GPU
device. After copying the result back to the host, your program
should print out the result matrix if N≤10. You may
use add-vectors.cu
as a starting point or start from
scratch.
It is natural to use a 2D grid for a matrix. In this case
the block_size
and num_blocks
variables
should be of type dim3
. The kernel launch area show
below accomplishes this
dim3 block_size( 16, 16 ); dim3 num_blocks( ( n - 1 + block_size.x ) / block_size.x, ( m - 1 + block_size.y ) / block_size.y ); add_matrices<<< num_blocks, block_size >>>( c_d, a_d, b_d, m, n );Of course, the kernel code will need to work correctly with a 2D grid rather than the 1D grid used in
add-vectors.cu
.
Test your code with a range of values of M and N. For each case, run your program both without and with the profiler.
Please turn in a printout of your final matrix-addition source code along with a short report summarizing the profiling data.