ITS is actively responding to the COVID-19 situation and making resources available for you. Learn more here.

Page tree

Overview

Caffe is typically run on a single GPU node, using 1-2 GPUs. Caffe is under active development, and the version available to all HPC users was built on 20170905. Compiling your own Caffe could be challenging (or frustrating), so contact Mike Renfro if you think you need a later version than is already installed.

Refer to the Slurm Quick Start User Guide for more information on Slurm scripts.


Single-computer, GPU-enabled Caffe Job (Interactive mode)

Start by reserving 1 CPU core (the default) and 1 GPU on an available GPU node:

[renfro@login ~]$ hpcshell --gres=gpu:1

Get some source data for Caffe to train and test against, using the Training LeNet on MNIST with Caffe documentation as a guide, and return to the original working directory:

[renfro@gpunode001 ~]$ DATA="${PWD}/data/mnist"
[renfro@gpunode001 ~]$ mkdir -p ${DATA}
[renfro@gpunode001 ~]$ cd ${DATA}
[renfro@gpunode001 mnist]$ MNIST_URL=http://yann.lecun.com/exdb/mnist
[renfro@gpunode001 mnist]$ wget ${MNIST_URL}/train-images-idx3-ubyte.gz
[renfro@gpunode001 mnist]$ wget ${MNIST_URL}/train-labels-idx1-ubyte.gz
[renfro@gpunode001 mnist]$ wget ${MNIST_URL}/t10k-images-idx3-ubyte.gz
[renfro@gpunode001 mnist]$ wget ${MNIST_URL}/t10k-labels-idx1-ubyte.gz
[renfro@gpunode001 mnist]$ gunzip train-images-idx3-ubyte.gz
[renfro@gpunode001 mnist]$ gunzip train-labels-idx1-ubyte.gz
[renfro@gpunode001 mnist]$ gunzip t10k-images-idx3-ubyte.gz
[renfro@gpunode001 mnist]$ gunzip t10k-labels-idx1-ubyte.gz
[renfro@gpunode001 mnist]$ cd ../..
[renfro@gpunode001 ~]$

Since the LMDB storage used in Caffe is incompatible with remote file storage (see LMDB documentation, section Caveats), make a temporary working directory on a local disk:

[renfro@gpunode001 ~]$ EXAMPLE="${PWD}/examples/mnist"
[renfro@gpunode001 ~]$ mkdir -p ${EXAMPLE}

Load the Caffe module:

[renfro@gpunode001 ~]$ module load caffe cuda80/toolkit

Convert the source data into LMDB format, and store it on the temporary working directory:

[renfro@gpunode001 ~]$ convert_mnist_data ${DATA}/train-images-idx3-ubyte ${DATA}/train-labels-idx1-ubyte ${EXAMPLE}/mnist_train_lmdb --backend=lmdb
[renfro@gpunode001 ~]$ convert_mnist_data ${DATA}/t10k-images-idx3-ubyte ${DATA}/t10k-labels-idx1-ubyte ${EXAMPLE}/mnist_test_lmdb --backend=lmdb

You should see output similar to the following:

Creating lmdb...
I1009 10:58:08.194444 81274 db_lmdb.cpp:35] Opened lmdb examples/mnist/mnist_train_lmdb
I1009 10:58:08.194737 81274 convert_mnist_data.cpp:88] A total of 60000 items.
I1009 10:58:08.194749 81274 convert_mnist_data.cpp:89] Rows: 28 Cols: 28
I1009 10:58:10.110952 81274 convert_mnist_data.cpp:108] Processed 60000 files.
I1009 10:58:10.748302 81282 db_lmdb.cpp:35] Opened lmdb examples/mnist/mnist_test_lmdb
I1009 10:58:10.748600 81282 convert_mnist_data.cpp:88] A total of 10000 items.
I1009 10:58:10.748613 81282 convert_mnist_data.cpp:89] Rows: 28 Cols: 28
I1009 10:58:11.037324 81282 convert_mnist_data.cpp:108] Processed 10000 files.
Done.

Download the .prototxt sources for the Caffe model to the temporary working directory:

[renfro@gpunode001 ~]$ PROTOTXT_URL=https://raw.githubusercontent.com/BVLC/caffe/master/examples/mnist
[renfro@gpunode001 ~]$ ( cd ${EXAMPLE} && wget ${PROTOTXT_URL}/lenet_solver.prototxt )
[renfro@gpunode001 ~]$ ( cd ${EXAMPLE} && wget ${PROTOTXT_URL}/lenet_train_test.prototxt )

Finally, we're ready to train the Caffe model. Run:

[renfro@gpunode001 ~]$ caffe train --solver=${EXAMPLE}/lenet_solver.prototxt

Caffe will print a lot of output to the screen. An abridged version is shown below:

Sample Caffe output
I1009 11:16:47.397012 83763 caffe.cpp:218] Using GPUs 0
I1009 11:16:47.470235 83763 caffe.cpp:223] GPU 0: Tesla K80
I1009 11:16:48.559469 83763 solver.cpp:44] Initializing solver from parameters:
test_iter: 100
test_interval: 500
...
I1009 11:17:58.769764 83763 solver.cpp:310] Iteration 10000, loss = 0.00413693
I1009 11:17:58.769783 83763 solver.cpp:330] Iteration 10000, Testing net (#0)
I1009 11:17:58.902169 83834 data_layer.cpp:73] Restarting data prefetching from start.
I1009 11:17:58.905859 83763 solver.cpp:397]     Test net output #0: accuracy = 0.9904
I1009 11:17:58.905886 83763 solver.cpp:397]     Test net output #1: loss = 0.0288629 (* 1 = 0.0288629 loss)
I1009 11:17:58.905903 83763 solver.cpp:315] Optimization Done.
I1009 11:17:58.905910 83763 caffe.cpp:259] Optimization Done.

The final result of the Caffe training is a set of .caffemodel and .solverstate files in the examples/mnist folder. The .caffemodel files contain the weights of each layer of the Caffe model, and the .solverstate files are checkpoint files useful for continuing training:

[renfro@gpunode001 mnist]$ ls -lt ${EXAMPLE}
total 6760
-rw-r--r-- 1 renfro domain users 1724471 Oct  9 11:17 lenet_iter_10000.solverstate
-rw-r--r-- 1 renfro domain users 1725006 Oct  9 11:17 lenet_iter_10000.caffemodel
-rw-r--r-- 1 renfro domain users 1724470 Oct  9 11:17 lenet_iter_5000.solverstate
-rw-r--r-- 1 renfro domain users 1725006 Oct  9 11:17 lenet_iter_5000.caffemodel
-rw-r--r-- 1 renfro domain users    2282 Oct  9 11:11 lenet_train_test.prototxt
-rw-r--r-- 1 renfro domain users     791 Oct  9 11:11 lenet_solver.prototxt
drwxr--r-- 2 renfro domain users      36 Oct  9 10:58 mnist_test_lmdb
drwxr--r-- 2 renfro domain users      36 Oct  9 10:58 mnist_train_lmdb
[renfro@gpunode001 mnist]$

Exit from the GPU node:

[renfro@gpunode001 ~]$ exit
exit
[renfro@login ~]$

Single-computer, GPU-enabled Caffe Job (Batch mode)

Make a SLURM job script named caffe.sh, based off the above interactive run and the Training LeNet on MNIST with Caffe documentation:

caffe.sh
#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --time=00:05:00
 
module load caffe cuda80/toolkit
 
# make directories
DATA="${PWD}/data/mnist"
EXAMPLE="${PWD}/examples/mnist" # was ${TDIR}/examples/mnist
mkdir -p "${DATA}" "${EXAMPLE}"
 
# get source data
MNIST_URL=http://yann.lecun.com/exdb/mnist
for fname in train-images-idx3-ubyte train-labels-idx1-ubyte t10k-images-idx3-ubyte t10k-labels-idx1-ubyte
do
    if [ ! -e "${DATA}/$fname" ]; then
        ( cd "${DATA}" && wget ${MNIST_URL}/${fname}.gz && gunzip ${fname}.gz )
    fi
done
 
# create LMDB
BACKEND="lmdb"
convert_mnist_data "${DATA}/train-images-idx3-ubyte" \
      "${DATA}/train-labels-idx1-ubyte" "${EXAMPLE}/mnist_train_${BACKEND}" --backend=${BACKEND}
convert_mnist_data "${DATA}/t10k-images-idx3-ubyte" \
      "${DATA}/t10k-labels-idx1-ubyte" "${EXAMPLE}/mnist_test_${BACKEND}" --backend=${BACKEND}
 
# get prototxt models
PROTOTXT_URL=https://raw.githubusercontent.com/BVLC/caffe/master/examples/mnist
for fname in lenet_solver.prototxt lenet_train_test.prototxt
do
    ( cd ${EXAMPLE} && wget ${PROTOTXT_URL}/${fname} )
done
 
# train model
caffe train --solver=${EXAMPLE}/lenet_solver.prototxt

Submit the job from the login node with the command sbatch caffe.sh, and when the job completes, you should have the lenet_iter_10000.solverstate and lenet_iter_10000.caffemodel files in the examples/mnist directory.

How helpful was this information?

Your Rating: Results: 1 Star2 Star3 Star4 Star5 Star 92 rates