CLA Compute Cluster Intro

CLA Compute Cluster Intro

 

A cluster is a group of servers that are configured to work together in a manner such that they can essentially be viewed as one big computer from the standpoint of the user. Configuring servers in this way allows for better resource allocation so that users can request and reserve resources that they will need for the duration of their job. The cluster also provides parallel processing capabilities in which a large task is broken down into smaller tasks. These smaller tasks are then run simultaneously, which can greatly reduce the processing time needed to analyze the data.

 

Hardware and Configuration

 

The CLA cluster consists of a single head node (compute.cla.umn.edu) and six compute nodes.

Each compute node has two (2) 10-core processors for a total of 120 physical cores (or 240 logical cores with Hyper-Threading). Three of the compute nodes each have 256 GB of RAM and the other three have 320 GB each for a total of 1.7TB of RAM. To manage these resources and allocate them efficiently, the cluster uses the TORQUE resource manager in conjunction with the Maui job scheduler. The cluster also has a small dedicated GPU available for testing code designed to run on this type of processor. If you would like to use the GPU, please let us know so that we can provide access to the appropriate queue.

 

Who can use the cluster?

 

Anyone with a CLA Linux/Unix account can use compute.cla and submit jobs to the batch and/or interactive queues. Other queues are restricted, with access being granted on an individual needs-based basis. Please see the table below in the Queues and Resource Limits section for more information.


How do I use the cluster?

 

To use the cluster, you can either connect to LTS using an NX Client and then ssh to compute.cla.umn.edu or you can ssh directly to compute.cla.umn.edu. More detailed information on accessing the cluster, including which connection method to choose, can be found at z.umn.edu/ltsconnect.

 

Why should I use the cluster?

 

There are several advantages to using the cluster instead of one of the CLA standalone computers. To start with, the cluster is newer hardware with faster CPUs and more RAM than standalone machines. It also has a GPU available for those interested in testing out GPU code. But the main reason to use the cluster has to do with resource allocation. When you submit a job on the cluster, you can specify parameters, such as the number of cores you want, the amount of RAM you will need, and the maximum number of hours you expect your job to take. These resources are then reserved for you for the time requested or until your job finishes, whichever comes first.

 

If you were to run that same job on a standalone computer instead of the cluster, your job would be contending for resources with whatever other users on that computer happen to be doing at the time. If they were running jobs that require a lot of CPU and/or RAM, for example, there may not be enough resources available for your job to run in a timely manner. If you have ever logged in to a CLA standalone computer and the system has appeared sluggish and/or the program that you are trying to use takes an unusually long time to load, that’s an indication that the system was bogged down by too many users contending for too few resources. This won’t happen on the compute cluster because resources are allocated based on what each user requests for their job. You can specify, for example, that you will need 2 cores and 16GB of RAM for 24 hours and those resources will be reserved for your use for that amount of time. On a standalone computer, you get whatever resources are available and, since that depends on whatever other users happen to be doing on that computer at the time, there is no guarantee that the resources that are available will be sufficient to allow you to run your job in a timely manner.  

 

Torque and Maui

 

As outlined above, a cluster differs from a group of standalone computers in that it offers users a ‘single system image’ in terms of the management of their jobs and the aggregate compute resources available. The software that CLA uses to integrate a group of servers into a cluster consists of two packages: TORQUE and MAUI.

 

TORQUE is a resource management system that is used for submitting and controlling jobs on the cluster. TORQUE manages jobs that users submit to various queues, with each queue representing a group of resources with attributes that are specific to that particular queue. TORQUE is based on the original open source Portable Batch System (OpenPBS) project and the terms TORQUE and PBS are often used interchangeably.

 

Whereas TORQUE provides a mechanism for submitting, launching, and tracking jobs on the cluster, the MAUI Cluster Scheduler is used to manage and schedule those jobs, determining when, where, and how jobs are run so as to maximize the output of the cluster. MAUI is highly configurable and optimized to support an array of scheduling policies.

 

Queues and Resource Limits

 

The table below shows the default and maximum compute resources for each queue. The interactive and batch queues are available to everyone but access to the highmem, GPU and multinode queues is restricted. If you find you're needing more resources than what the batch queue provides, let us know and we can add you the highmem or multinode queues as needed. Access to the GPU queue is also restricted so If you would like to use the GPU, please let us know so that we can provide you access to that queue.

 

 

Queue Name

Access Control List

Interactive?

Default

Maximum

Walltime

RAM

(GB)

Nodes

Procs

Walltime

RAM

(GB)

Nodes

Procs

interactive

None

Yes

24:00:00

8

1

2

168:00:00

64

1

10

batch

None

No

48:00:00

8

1

2

168:00:00

64

1

10

highmem

Yes

No

8:00:00

96

1

4

96:00:00

250

1

40

GPU

Yes

Yes

8:00:00

96

1

4

96:00:00

250

1

40

multinode

Yes

No

8:00:00

-

1

4

336:00:00

-

    3

120

 

 

Storage / scratch

 

Each of the compute nodes has a scratch.local directory at the root level of the file system. Since working with local files is much faster than working with files that are accessed over the network (e.g., files in a home directory or in /labs, for instance), local scratch space is provided on each node in order to allow for faster file access. For example, if your program writes out files of  intermediate and/or final results, writing these files to /scratch.local on the compute node will allow for faster read/write access than writing them to a network share. When you submit a job, a directory is automatically created in /scratch.local for your scratch files. The name of the directory includes the PBS job ID of your job followed by “.compute.cla.umn.edu”, e.g, 6822.compute.cla.umn.edu. Please note that for any files in /scratch.local that you want to save, your script MUST include commands to copy them from the local scratch directory to a destination on a network share (e.g., your home dir, /labs/$labname/foldername, etc.) as once your job terminates, your files in /scratch.local will be deleted and cannot be recovered.

 

Comments