Monitoring jobs on compute.cla

Monitoring Jobs on compute.cla

Checking Job Status

To see the status of your job, enter the following at the command prompt:

    user@compute:~$ qstat -a

The result will look something like the following:

                                                                                                                                                        Req'd         Elap

Job ID                         Username    Queue    Jobname      SessID  NDS   TSK   Memory    Time     S   Time

-----------------------          -----------        --------    ---------------- ------      -----    ------ ---------     ---------  -    ---------

9876.compute.cla.umn.e    user    batch    myjob.pbs    60476      1      2      8gb         1:00:00  R    0:00:10

The output is fairly self-explanatory. Perhaps the main item to note is the State (“S”) column where the “R” indicates that the job is running. Other entries you may see in that column are “Q” for “queued”, “E” for “exiting”, or “C” for “completed.”

The “qstat -f” command will give you more information on the jobs you have in queue, including, for example, the execution host(s), variable list, and walltime remaining. More information on the qstat command can be found on the manpage.

Checking Job Array Status

Checking the status of an entire job array is done by running qstat with the -t option. Each array element will appear as a separate job in the queue and the normal scheduling rules apply to each element. The name of the array will be the job number assigned by PBS followed by a set of brackets. For example, if the assigned job number is 9876, the entire job array will be denoted as 9876[] and the individual jobs will be 9876[1], 9876[2], 9876[3] and so on.

user@compute$ qstat -t

Job ID                    Name             User            Time Use S Queue

------------------------- ---------------- --------------- -------- - -----

2868[1].compute            test.pbs-1      user        00:00:06 R batch        

2868[2].compute            test.pbs-2      user               0 R batch        

2868[3].compute            test.pbs-3      user               0 Q batch        

2868[4].compute            test.pbs-4      user               0 Q batch        

2868[5].compute            test.pbs-5      user               0 Q batch        

2868[6].compute            test.pbs-6      user               0 Q batch        

2868[7].compute            test.pbs-7      user               0 Q batch        

2868[8].compute            test.pbs-8      user               0 Q batch        

2868[9].compute            test.pbs-9      user               0 Q batch        

2868[10].compute           test.pbs-10     user               0 Q batch  

Checking Job Logs


PBS by default will log both stdout and stderr to the job submission directory. (See the document for submitting jobs for information on how to have PBS log to another location.) If your job doesn’t run as  expected, check the stderr log for errors. If you submit a large batch of jobs, an easy way

to check for errors is to look for stderr files whose size is greater than 0.

user@compute:~$ find . -type f -name “$JOBNAME.e*” -size +0c | xargs grep -iv loaded

Note: Module file loads and unloads get written to stderr. The xargs portion of the above command is a workaround since torque error logs will record module loads and unloads.  If you aren’t loading any modules when you run your job, you can exclude the last section of the above command ( the “| xargs grep -iv loaded” part) and just check for error files that have a size greater than 0.