Transferring files

Overview

Teaching: 30 min
Exercises: 10 min
Questions
  • How do I upload/download files to the cluster?

Objectives
  • Be able to transfer files to and from a computing cluster.

Computing with a remote computer offers very limited use if we cannot get files to or from the cluster. There are several options for transferring data between computing resources.

Download files from the internet using wget

One of the most straightforward ways to download files is to use wget. Any file that can be downloaded in your web browser with an accessible link can be downloaded using wget. This is a quick way to download datasets or source code.

The syntax is: wget https://some/link/to/a/file.tar.gz. For example, download the lesson sample files using the following command:

[nsid@platolgn01 ~]$ wget https://ofisette.github.io/hpc-intro-plato/files/bash-lesson.tar.gz

Transferring single files and folders with scp

To copy a single file to or from the cluster, we can use scp (“secure copy”). The syntax can be a little complex for new users, but we’ll break it down.

To transfer to another computer:

[user@laptop ~]$ scp path/to/local/file.txt nsid@plato.usask.ca:path/on/Plato

Transfer a file

Create a “calling card” with your name and email address, then transfer it to your home directory on Plato.

Solution

Create a file like this, with your name (or an alias) and top-level domain:

[user@laptop ~]$ cat calling-card.txt
Your Name
Your.Address@institution.tld

Now, transfer it to Plato:

[user@laptop ~]$ scp calling-card.txt nsid@plato.usask.ca:/globalhome/nsid/HPC/
calling-card.txt                                                 100%   37     7.6 KB/s   00:00

We can often simplify the path given to the command. On the remote computer, everything after the : is relative to our home directory. A single : would put a file directly in your home directory.

[user@laptop ~]$ scp local-file.txt nsid@plato.usask.ca:

To recursively copy a directory, we just add the -r (recursive) flag:

[user@laptop ~]$ scp -r some-local-folder/ nsid@plato.usask.ca:

This will create the directory some-local-folder on the remote system, and recursively copy all the content from the local to the remote system. Existing files on the remote system will not be modified, unless there are files from the local system with the same name, in which case the remote files will be overwritten.

The trailing slashes in the directory names are optional, and have no effect for scp -r, but they are important in other commands, like rsync.

To download from another computer:

[user@laptop ~]$ scp nsid@plato.usask.ca:path/on/Plato/file.txt path/to/local/

A note on rsync

As you gain experience with transferring files, you may find the scp command limiting. The rsync utility provides advanced features for file transfer and is typically faster compared to both scp and sftp (see below). It is especially useful for transferring large and/or many files and creating synced backup folders.

The syntax is similar to scp. To transfer to another computer with commonly used options:

[user@laptop ~]$ rsync -rvzP path/to/local/file.txt nsid@plato.usask.ca:directory/path/on/Plato/

The r (recursive) option copies files and directories recursively; the v (verbose) option gives verbose output to help monitor the transfer; the z (compression) option compresses the file during transit to reduce size and transfer time; and the P (partial/progress) option preserves partially transferred files in case of an interruption and also displays the progress of the transfer.

To recursively copy a directory, we can use the same options:

[user@laptop ~]$ rsync -rvzP path/to/local/dir nsid@plato.usask.ca:directory/path/on/Plato/

As written, this will place the local directory and its contents under the specified directory on the remote system. If the trailing slash is omitted on the destination, a new directory corresponding to the transferred directory (‘dir’ in the example) will not be created, and the contents of the source directory will be copied directly into the destination directory.

To download a file, we simply change the source and destination:

[user@laptop ~]$ rsync -rvzP nsid@plato.usask.ca:path/on/Plato/file.txt path/to/local/

Archiving files

One of the biggest challenges we often face when transferring data between remote HPC systems is that of large numbers of files. There is an overhead to transferring each individual file and when we are transferring large numbers of files these overheads combine to slow down our transfers to a large degree.

The solution to this problem is to archive multiple files into smaller numbers of larger files before we transfer the data to improve our transfer efficiency. Sometimes we will combine archiving with compression to reduce the amount of data we have to transfer and so speed up the transfer.

The most common archiving command you will use on a (Linux) HPC cluster is tar. tar can be used to combine files into a single archive file and, optionally, compress. For example, to collect all files contained inside output_data into an archive file called output_data.tar we would use:

[user@laptop ~]$ tar -cvf output_data.tar output_data/

The options we used for tar are:

The tar command allows users to concatenate flags. Instead of typing tar -c -v -f, we can use tar -cvf.

The tar command can also be used to interrogate and unpack archive files. The -t argument (“table of contents”) lists the contents of the referred-to file without unpacking it.
The -x (“extract”) flag unpacks the referred-to file. To unpack the file after we have transferred it:

[user@laptop ~]$ tar -xvf output_data.tar

This will put the data into a directory called output_data. Be careful, it will overwrite data there if this directory already exists!

Sometimes you may also want to compress the archive to save space and speed up the transfer. However, you should be aware that for large amounts of data compressing and un-compressing can take longer than transferring the un-compressed data so you may not want to transfer. To create a compressed archive using tar we add the -z option and add the .gz extension to the file to indicate it is gzip-compressed, e.g.:

[user@laptop ~]$ tar -czvf output_data.tar.gz output_data/

The tar command is used to extract the files from the archive in exactly the same way as for uncompressed data. The tar command recognizes that the data is compressed, and automatically selects the correct decompression algorithm at the time of extraction:

[user@laptop ~]$ tar -xvf output_data.tar.gz

Transferring files

Using one of the above methods, try transferring files to and from the cluster. Which method do you like the best?

Working with Windows

When you transfer files to from a Windows system to a Unix system (Mac, Linux, BSD, Solaris, etc.) this can cause problems. Windows encodes its files slightly different than Unix, and adds an extra character to every line.

On a Unix system, every line in a file ends with a \n (newline). On Windows, every line in a file ends with a \r\n (carriage return + newline). This causes problems sometimes.

Though most modern programming languages and software handles this correctly, in some rare instances, you may run into an issue. The solution is to convert a file from Windows to Unix encoding with the dos2unix command.

You can identify if a file has Windows line endings with cat -A filename. A file with Windows line endings will have ^M$ at the end of every line. A file with Unix line endings will have $ at the end of a line.

To convert the file, just run dos2unix filename. (Conversely, to convert back to Windows format, you can run unix2dos filename.)

A note on ports

All file transfers using the above methods use encrypted communication over port 22. This is the same connection method used by SSH. In fact, all file transfers using these methods occur through an SSH connection. If you can connect via SSH over the normal port, you will be able to transfer files.

Key Points

  • wget downloads a file from the internet.

  • scp transfer files to and from your computer.