Open the Canaan user guide in a new browser window (hold the shift key down while you click on the link). Also open a terminal window (Ctrl-Alt-T) and arrange your windows so you have this page and the other two windows all visible at the same time.
To facilitate logging in and transferring files to and from remote machines like Canaan, it is useful to create an SSH pubic/private key pair for your account on the workstations. After installing your public key on a remote machine you will be able to make secure connections authenticated by the key-pair rather than a password. The instructions here presume you are working on one of our Minor Prophet workstations, but should also work from any other Linux machine or Mac running OS/X.
Before beginning, it's worth checking to see if you already have a
key-pair. This is easy; just list your ~/.ssh
directory with
ls -l ~/.sshand see if the files
id_rsa
and id_rsa.pub
are present. If so then you already have a key-pair. If
not, type
ssh-keygenand press
Enter
at each prompt to accept the offered
defaults.
It's important that the ownership and permissions of
the ~/.ssh
directory and its contents are set
correctly. You can do this with the commands
chown -R $(whoami) ~/.ssh chmod 0700 ~/.ssh chmod 0600 ~/.ssh/id_rsaThe first of these commands will generate an error if you are not the owner of your
~/.ssh
directory and all its contents. If
this happens, let the instructor know so the ownerships can be set
correctly.
Create or edit the file ~/.ssh/config
and make sure it includes
the lines (replacing firstname.lastname with your Gordon username):
Host canaan.gordon HostName canaan.phys.gordon.edu User firstname.lastname ServerAliveInterval 120 ServerAliveCountMax 2 GSSAPIAuthentication no ForwardX11 yes
The first line, starting with Host
, not only starts the
configuration block for this host, but also defines an alias for the
host. You can now either use ssh to connect
to canaan.phys.gordon.edu
or canaan.gordon
.
The User
line sets the username to use when connecting
to the host. This is necessary when the usernames vary across
machines. For example, you can configure your accounts on your
personal computer and the workstations to allow you to connect via
ssh without having to type a password but you will probably need to
add lines like those above into ~/.ssh/config
on your
personal computer so your Gordon username is used when connecting to
the workstations.
Now let's try logging in to Canaan. Type
ssh canaan.phys.gordon.edu
(You could use canaan.gordon
if you set that in
your ~/.ssh/config
file.) You may be warned this is a new
connection to an unknown machine, just say “yes” and keep
going. Type in your ID number when prompted for your password. If
all goes well you should then be logged in. Please change your
password now to something other than your ID number; do this with
the yppasswd
command (please
do not use passwd
):
yppasswd
You'll be asked for your old password and then a new password twice. Be sure at the end of the process you are told your password was updated successfully.
It's very convenient to configure your account so you can log in from your account on the workstations without having to type your password. To do this we need to copy some files from the workstations to Canaan. Log out of Canaan to get back to your workstation account prompt.
Be sure your working on one of the workstations and type the
following commands to configure your account on Canaan to accept
logins from your account on the workstations. You will be prompted
for the password to your account on Canaan several times; use the
password you chose to use on Canaan. The first of these commands
will likely generate an error message since you probably already
have an ~/.ssh
directory on Canaan – you can just
ignore it.
ssh canaan.phys.gordon.edu mkdir .ssh ssh canaan.phys.gordon.edu chmod go-rwx .ssh scp ~/.ssh/id_rsa.pub canaan.phys.gordon.edu:.ssh/authorized_keys ssh canaan.phys.gordon.edu chmod go-rwx .ssh/authorized_keys
You should now be able to type
ssh canaan.phys.gordon.eduand be immediately logged in without having to type your password.
It is not possible to connect directly to Canaan from outside Gordon's network. To work on Canaan, then, you can (1) use SSH to connect to the workstations as you normally do, and then (2) use SSH to connect to Canaan. This works, but has several limitations, the most notable being that you're limited to working in a non-GUI terminal environment. You also have to remember to log off twice, once from Canaan and then again from the workstations.
It is possible to configure some Remote Desktop clients to connect directly to Canaan via an SSH tunnel, contact the professor if you want to know more. Another alternative is X2Go, set up for which is described next. You're welcome to try other approaches than those mentioned here (terminal/putty ssh connections and X2Go), but it will be important for you to find a system that works for you.
X2Go provides a remote-desktop-like experience for connecting to Linux/Unix machines running a X server. One advantage it has over Remote Desktop is that it is optimized for WAN and slower LAN connections – which makes it quite useful when working remotely.
Browse to http://wiki.x2go.org/doku.php/doc:installation:x2goclient and download an X2go client for your system. Clients are available for Windows, OS-X, and many distributions of Linux.
Install the client according to instructions on the client download page.
Start the X2go client and click on the "New Session" icon in the upper left corner and enter in the following information:
Click "OK" to create the session launcher. The new session launcher will appear on the right side of the client window.
Select the desired window size from the pull-down menu inside the session launcher. The default is 800x600 but you may prefer to use a larger size such as 1280x1024. This will stay at whatever you set it to until changed. You may need to experiment to find a setting that works well for you.
Configured X2Go sessions will appear on the right side of the X2Go window.
Click on the session name to start a session. If you are prompted for your password, type it in (remember that it is case-sensitive). After entering it a new virtual desktop window should appear and you should be logged into Canaan.
If an authentication window appears asking you "Password for root:" you can click "cancel" to dismiss the window.
Much of the work on Canaan is done using commands typed into a terminal window. To open a terminal window, use Applications -> System Tools -> Terminal. (You can drag the terminal icon from the System Tools menu to the panel at the top of the desktop; this allows you to open a terminal window with a single click.)
To end your session, click on your name in the upper right of the virtual desktop, select Quit..., and then click on Log Out. Important: Please be sure to do this otherwise your session will remain active.
Let's get a copy of the class repository on Canaan. This is done
three familiar steps: (1) decide where you want it (I suggest
creating the directory ~/cps343
and putting it there),
(2) changing to the destination location, and (3) using
the git clone
command. The following steps assume you
use the suggested location:
mkdir ~/cps343 cd ~/cps343 git clone https://github.com/gordon-cs/cps343-hoeAlthough this clones the current version of the repository, it doesn't include any of the files you've created on the minor prophet workstations. If you want to “mirror” your repository from there, you can use the
rsync
command. I suggest doing
this in two steps: (1) check what will be copied, and (2) do the
copy. In the steps below replace "sally.smith" with your username on
the workstations. Here we also assume that your repository on the
workstations is stored in ~/cps343/cps343-hoe
, you will
need to modify the commands below if it's stored at another location.
cd ~/cps343/cps343-hoe rsync -nav sally.smith@files.cs.gordon.edu:cps343/cps343-hoe/ ./Note: the trailing slashes are important! The
-nav
switch is actually three different switches: -n
(this is
“dry run”; nothing is actually
transferred), -a
(archive mode - preserve file
permissions and dates), and -v
(be verbose; show what
files are being copied). If the list of files that would be copied
seems reasonable, you can reissue the same command but without
the -n
option:
rsync -av sally.smith@files.cs.gordon.edu:cps343/cps343-hoe/ ./Now your repository files on Canaan should match those on the workstation cluster.
Okay, let's explore a little bit. Take a look at the Canaan the cluster configuration description. You'll notice that the cluster has a head node (this is "Canaan"), a storage/administration node, 18 compute nodes, and two network switches.
The compute nodes are arranged into two partitions, one called phys with 16 compute nodes and the other called chem with only 2 compute nodes. The sixteen phys nodes each have 24G RAM and most have 12 cores while three only have 8 cores. The chem partition, on the other hand, only has two nodes, but each has 16 cores and 64G RAM.
You've already learned a little about the Lmod Environment Modules and the SLURM resource manager back in the Cluster Computing with MPI hands-on exercise. Much of the same material is included in the Canaan user guide; please find and scan it quickly now.
Try using the module avail
and module
list
commands to see what modules are available and
loaded.
Next, use sinfo
to explore the available partitions
and use squeue
to see if any jobs are currently
running. Notice that the output from sinfo
shows that
there is an additional partition called allNodes: this
includes all 18 compute nodes, meaning all 212 compute cores can
potentially be used on a single job.
If you're using X2go, try starting sview
. You can
leave this running to provide a visual snapshot of the cluster's job
status.
Next we'll be using srun
to run parallel programs, but
as noted in our previous hands-on exercise, this command can be used
to run sequential programs on the cluster nodes as well. Try the
following and talk about the results of each command with someone
near you.
srun --ntasks=4 hostname srun -n 4 hostname srun --nodes=4 hostname srun -N 4 hostname srun --ntasks=4 --tasks-per-node=2 hostname srun --ntasks=4 --tasks-per-node=1 hostnameYou are encouraged to read the
srun
manual page and try
other switches.
Okay, time to do something in parallel! We'll start by running the parallel Laplace solvers from last week's exercise. First, let's make sure the OpenMPI and Parallel HDF5 modules are loaded
module load openmpi hdf5
Next, change in the cps343-hoe/06-mpi-cartgrid
and
type make
to build the programs. As a quick check, run
the cart
program with
srun -n 4 ./cart
You should see the same output as you saw last week.
Note: We used salloc
rather
than srun
on the minor prophets cluster
because srun
does not work properly there in certain
situations. If you prefer you can continue to use commands like
salloc -Q -n 4 mpiexec ./cart
on Canaan so everything works the same as on the minor prophets
cluster. In the examples that follow I'll be using srun
.
Last week when we work working on the minor prophets cluster we found that when we ran four processes (tasks) on a single node the Laplace MPI program ran much faster than when we placed the four processes on different nodes. Let's see if that's true on Canaan. Type
srun --ntasks=8 --exclusive ./laplace-mpi -n 100 srun --nodes=8 --exclusive ./laplace-mpi -n 100(remember we have at least eight cpu cores on each node). You should see it takes more than twice as long to solve the problem - this is solely due to the interconnect communication time. However, it's exacerbated by the relatively small problem size. When we increase the grid size the times are still slower for when working across nodes but the relative impact is less:
srun --ntasks=8 --exclusive ./laplace-mpi -n 200 srun --nodes=8 --exclusive ./laplace-mpi -n 200
If you want to go back to the minor prophets cluster and try this experiment (but remember you can use only 4 tasks on a node), you'll see that the relative slow-down due to network communication was much worse.
Now that we know that communication between nodes is much faster on Canaan than on the minor prophets cluster, let's see what impact using nonblocking communication can have. To do this we'll generate some data and then plot it. To start, use cut-and-paste to run the following two shell loop commands:
for ((n=1;n<=16;n++)) do echo $n $(srun --nodes=$n --exclusive ./laplace-mpi -n 200 | awk '{print $7}') done | tee bb-200-nodes.dat
and
for ((n=1;n<=16;n++)) do echo $n $(srun --nodes=$n --exclusive ./laplace-mpi-nb -n 200 | awk '{print $7}') done | tee nb-200-nodes.dat
To plot the data we've just created, start gnuplot and type the
following commands at the gnuolot>
prompt:
set xlabel "Nodes (tasks)" set ylabel "Time (seconds)" set title "Blocking vs Nonblocking communication in Laplace Solver" set key top right plot "bb-200-nodes.dat" with linespoints, "nb-200-nodes.dat" with linespoints
Notice that the times in bb-200-nodes.dat
(blocking)
are nearly always larger than the times
in bb-200-nodes.dat
(nonblocking).
To save your graph as a PDF document type something like
set term png set output "graph.png"
then reissue the plot
command (you can use the up-arrow
to get back previous commands). To get back to plotting on the screen
use
set term "x11" set output
To quit gnuplot, type exit
at the prompt.
You can view the PNG file from the command line using the
display
program:
display graph.png
Change into cps343-hoe/07-parallel-sorting
directory
of the class repository. Take some time to examine the source code
of the psrs_qsort_timing.cc
you'll find there. In
particular, notice that the psrsSort()
function carries
out the following steps:
MPI_Gather()
to collect them all on the master process.MPI_Bcast()
to distributed them to all
processes.MPI_Allgather()
to collect all sublist length
information, determine length of output list, and allocate memory
for it.MPI_Sendrecv()
to exchange list data with
other processes.Compile the program with
mpic++ -O2 -o psrs_qsort_timing psrs_qsort_timing.cc(don't use
smake
for this). To run the program you will
need to supply a single positive integer argument that is the total
length of the list. Start small (less than 100) and gradually
increase the list size.
srun -n 8 ./psrs_qsort_timing 10 srun -n 8 ./psrs_qsort_timing 10000 srun -n 8 ./psrs_qsort_timing 100000000
When the list is short you'll see the final sorted list displayed.
In each case you'll see the overall time required by
the psrsSort()
function.
Now recompile the program using smake
or by adding
the -DSHOW_TIMING_DATA
to the compiler command line. Try
srun -n 8 ./psrs_qsort_timing 100000000and observe the timing data that is displayed. You should see that most of the time is spent doing the serial quicksorts. The final column displays the ratio of (communication time) / (sort time + communication time).
Run the program with even larger list sizes and request more cpu cores. Notice how the timing data, especially the ratio in the right column, changes. Some examples might be
srun -n 20 ./psrs_qsort_timing 400000000 srun -n 60 ./psrs_qsort_timing 400000000 srun -p chem -n 32 ./psrs_qsort_timing 400000000where the last example uses the two 16-core computers in the chem partition. Play about a bit!
Submit an image or PDF version of labeled graph showing the timing
comparison between the blocking and nonblocking Laplace solvers. To
do this, log into blackboard from Canaan and submit your
image directly from there. Alternatively, transfer the image file
to the workstations and then your computer
using scp
, sftp
, or some other transfer
mechanism.