Submitting jobs to cluster from local machine
by Guido España
April 21, 2019
Motivation
Imagine you need to execute large amounts of simulations in the cloud that require many configuration files, which require libraries that can only be installed in your local machine, or they require user permissions, or you would like to keep your folder structure for your experiments. This is probably a very narrow case-scenario, but a possibility nonetheless. I encountered a similar case-scenario a few weeks ago, and thought it would be worthwhile to document my workaround.
SSH with Paramiko in python
Paramiko is a python package that allows you to create safe connections to remote machines, using the SSHv2 protocol. This is useful because you can create scripts that automatically connect to a remote machine, run code there, and send files to the remote machine or copy them to your local machine.
In the following example, I will define a class to handle SSH connections to a remote machine using paramiko
.
class definition
The name of our class will be ssh_cluster_connection()
and it will contain all the information and routines necessary to create ssh connections and run commands using that connection.
Initialization
A basic initialization (below) of the ssh_cluster_connection
class will set the host name, username, and password. The remote_dir
argument specifies where in the remote machine we want to send or copy files, or execute commands. In this example, I’ll just focus on copying files to the remote machine and executing commands. The self.client = paramiko.SSHClient()
line tells paramiko that we will use the SSHClient() protocol.
class ssh_cluster_connection():
def __init__(self,hostname,username,password,remote_dir):
self.client = paramiko.SSHClient()
self.hostname = hostname
self.username = username
self.password = password
self.dir = remote_dir
Start connection
The following function uses the already initialized information in the class to actually connect to the cluster. We first load the system host keys and set a key missing policy, then we connect using client.connect(hostname,username,passowrd)
. It’s simple like that.
def start_connection(self):
print('starting connection to server')
try:
self.client.load_system_host_keys()
self.client.set_missing_host_key_policy(paramiko.WarningPolicy)
self.client.connect(
hostname=self.hostname,
username=self.username,
password=self.password)
except:
print('cannot start connection')
Sending files
With paramiko, we can use the file transfer protocol sftp
to just put the files we want to our server. This isn’t new information for those who have used scp
to send files to your institution’s cluster; it’s pretty simple. Anyway, if we want to create local functions that interact with the cluster, this step is necessary.
def put_files_in_server(self, local_file_path, remote_path):
self.sftp = self.client.open_sftp()
self.sftp.put(local_file_path, self.dir + '/' + remote_path)
Executing commands in the cluster
Let’s say you have a script called submit_cluster_jobs.sh
in your directory of choice. As its name indicates, this script will submit a set of jobs to the cluster. Executing this script is also straight forward with paramiko. We can use inn, out, err = self.client.exec_command()
for this task. We will also use an exit_status with out.channel.recv_exit_status()
to know if the command ran successfully or not. This approach results in our algorithm waiting for the command to finish before exiting. I’m oversimplifying things here, but this approach allows us to do a couple of interesting things. For instance, we could actually write the submit_cluster_jobs.sh
script in our local machine, send it to the cluster, and then run it in there. Imagine we are writing a calibration routine that runs 1,000 simulations with different parameters, and each of the 1,000 simulations is submitted to the cluster individually. We could create the calibration routine in our local machine, submit the jobs to the cluster, constantly check for those jobs to finish (probably using qstat
), and then sending another set of simulations to move to the next step of the calibration routine. I’m not elaborating on those routines here, but maybe later.
def run_command(self):
cmd = 'cd ' + self.dir + '; ./submit_cluster_jobs.sh'
try:
inn, out, err = self.client.exec_command(cmd)
exit_status = out.channel.recv_exit_status()
if exit_status == 0:
print('success')
else:
print(err.read())
except:
print('cannot run the command on the remote machine')
getpass
One of the concerns I had with this approach was that I would have to store my password in the script. This issue is solved with getpass module, which you would need to import import getpass
. Then, you can prompt once for your username and password, without exposing your password to the terminal history.
Finally, we need to close our connection using close()
. So, we can add this function to our ssh_cluster_connection
class.
Wrapping up
See below an example of how to send a file submit_cluster_jobs.sh
and run the command in the cluster. This is a basic example, but it can be expanded to run more sophisticated routines in the cluster and your local machine. Hope this helps.
username = input("Insert username: ")
password = getpass.getpass()
hostname = 'yourcluster.edu?'
simsdir = "my_sims_directory"
client_cluster = ssh_cluster_connection(hostname=hostname,
username=username,
password=password,
remote_dir = simsdir)
client_cluster.start_connection()
client_cluster.run_command()
client_cluster.put_files_in_server('submit_cluster_jobs.sh', 'submit_cluster_jobs.sh')
client_cluster.close_connection()