Frequently Asked Questions¶
How do I get GPU access?¶
We have L40S and H200 GPUs available on Engaging through the mit_normal_gpu partition. There are also a variety of GPU types available through the mit_preemptable partition. Take a look at out page on requesting resources to see how to request them for your job.
If your lab would like to purchase GPUs to be hosted on Engaging, please contact orcd-help-engaging@mit.edu.
How do I check the status of my job?¶
Instructions for checking job status can be found here.
How can I submit a module request?¶
We are open to creating new modules for the Engaging cluster. You can submit all module requests to orcd-help-engaging@mit.edu.
I am unable to install a package in R. How can I debug the issue?¶
We recommend using Conda to manage R packages. Please refer to the R user guide.
Can I use export-controlled software on the cluster?¶
Export-controlled software has specific requirements around who is allowed to access the software. Often, Engaging does not meet these requirements, so we generally do not allow such software to be used on our system. Please refer to the terms of use of the software and direct any questions to orcd-help@mit.edu.
How do I increase the time limit for my job?¶
Use the -t flag in your job script. If you do not specify, Slurm will give
you the maximum time limit for that partition. You can check the maximum time
limit by running sinfo -p <partition name>.
For public partitions on Engaging, such as mit_normal, we cannot increase the
maximum job time limit, as these resources are shared. For jobs that
need to run longer than the time limit, we encourage
checkpointing, which is a way of periodically saving progress so that subsequent
jobs can pick up where previous jobs left off. The implementation of checkpointing
is domain-specific and can vary greatly. You can find more information on
checkpointing here.
For increasing the maximum time limit on partitions owned by other groups, please email orcd-help-engaging@mit.edu.
How do I get an account?¶
If you have an MIT Kerberos account, then you can get an account on Engaging. To register, navigate to the Engaging OnDemand Portal and log in. Your account will automatically be created. Please wait a few minutes before trying to start any jobs or interactive sessions.
How do I install a Python package?¶
See our documentation on Python.
Why won't my application run on a different partition?¶
On Engaging, the older nodes (such as the sched_mit_hill and newnodes
partitions) run on CentOS 7 while the newer nodes (such as mit_normal and
mit_preemptable) run on the Rocky 8 operating system (OS). Each set of nodes
has a different set of modules, so if you have set up software to run on one OS,
it will probably not work on the other OS.
How do I run Jupyter notebooks?¶
You can run Jupyter in a few different ways:
- Web portal for the cluster you're using
- VS Code
- Port forwarding
See our Jupyter documentation.
Xfce desktop has failed to start. How can I fix this?¶
This issue is often caused by Conda setup commands existing in your ~/.bashrc
file. This happens when you run conda init when using Miniforge or another
Anaconda install. We recommend not running conda init as it can lead to
errors such as this one.
To fix this, remove or comment out all conda setup commands from your
~/.bashrc file.
How do I use Git on the cluster?¶
Git is highly encouraged for use on the cluster. It is useful for backing up code and version control, especially when collaborating with others.
We recommend setting up an SSH key with GitHub for security and convenience. This allows you to use the "SSH" link rather than the "HTTPS" link when cloning repositories. To set up an SSH key, follow these steps:
-
SSH to the cluster you're using
-
Enter the following from the command line:
-
Press "enter" to save your private and public keys to the default
~/.sshlocation. When prompted, optionally enter a passphrase for higher security. You will now have two new files in your~/.sshdirectory:id_ed25519andid_ed25519.pub. -
Print the contents of your public key (using
cat id_ed25519.pub) and copy the output -
Navigate to GitHub.com > click your profile in the top right corner > select "Settings" > "SSH and GPG keys" > "New SSH key"
-
Add a title (e.g., "engaging"), paste your public key, and click "Add SSH key"
See GitHub's documentation on SSH keys for more information.
Why doesn't my password work when I try to run the sudo command?¶
Regular users are not allowed to use sudo on Engaging. Engaging is a shared environment. Sudo enables root-level access, which allows our system administrators to modify system files, install software, and change permissions. If misused unintentionally or accidentally, it could compromise the entire cluster. Therefore, use of sudo is reserved for engaging system administrators who work to secure, maintain, and tune the cluster. If you need specific software and you are having difficulty installing it, contact orcd-help@mit.edu and someone on the staff can assist you. Please see https://orcd-docs.mit.edu/software/overview/ for more information.
What is the mit_preemptable partition? What is preemption?¶
The mit_preemptable partition allows you to run programs on lab-owned nodes while they're not being used. While this partition has higher resource limits and longer runtimes than other public partitions like mit_normal and mit_normal_gpu, jobs submitted to mit_preemptable are low priority and preemptable. See Preemptable Jobs for more information.
I got locked out of my Engaging account. How do I restore my access?¶
People sometimes get locked out of their accounts due to repeated failed authentication attempts, specifically from Duo two-factor authentication. When this happens, they get the following message:
This is usually caused by third-party software that connects to Engaging over SSH, such as VS Code. Your account will be automatically reactivated after a bit of time.
There are two things that can help:
- Use Control Channels to reduce the number of times you have to respond to Duo.
- If you use VSCode, adjust the VSCode Remote SSH settings, which will allow for more time to connect and reduce the number of auto-reconnect attempts.
I cannot connect to a compute node using VS Code remote SSH.¶
Sometimes, when following our instructions for running VS Code on the cluster, users are prompted to enter their password when they connect to the compute node and they get "permission denied." This is most often because they do not have an SSH key set up on Engaging. You can do so following these instructions.
I just created an account on Engaging, but I can't run any jobs. What's the problem?¶
Some users get the following error message when trying to submit a job right after creating their account:
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
It sometimes takes an extra bit of time for your account to be set up properly so that you can submit jobs. Wait about 15 minutes and try again.
I submitted a job to mit_normal_gpu and it's still pending in the queue. Why is it taking so long?¶
This is most likely because there aren't enough resources available or other jobs are ahead of yours in the queue (see Checking Job Status). To check what resources are available, use the sinfo command. This variation will show what GPU resources exist and are in use on each node in mit_normal_gpu:
The H200s on Engaging are in high demand. Jobs that request an H200 can sometimes wait a few hours until it's their turn to run. During high-demand times, such as leading up to conference deadlines, it can take even longer. Here are some steps you can take to minimize wait time:
- Consider using an L40S instead. L40S GPUs are less powerful than H200s yet much more readily available on Engaging. If your application requires less VRAM than what is available on one or two L40Ss (44GB each), then this is probably a good approach for you. Though H200s are faster, the increased wait time may outweigh the benefits in speedup.
- Request fewer resources (cpus, memory, or GPUs) or a shorter time limit. Slurm takes resource requests and time limits into account when scheduling jobs. Jobs that ask for less tend to start sooner. Use the
jobstatscommand to see what resources you used in your recent jobs. - Subscribe to a Standard or Advanced Account Level. Users with a valid cost object can pay a monthly fee to run higher-priority jobs and request more resources than the free tier. This doesn't guarantee that your jobs will run immediately, but they should have a shorter wait time overall. More information can be found on our Compute Services page.