Attention: Confluence is not suitable for the storage of highly confidential data. Please ensure that any data classified as Highly Protected is stored using a more secure platform.
If you have any questions, please refer to the University's data classification guide or contact ict.askcyber@sydney.edu.au
Job Monitoring and Management
You can use the "jobstat" command to monitor your jobs and storage usage on Artemis. For further information on jobstat, run "man jobstat" on Artemis.
[abcd1234@login3 ~]$ jobstat Job Summary for user abcd1234                           Requested -------------------------        Elapsed ------------------------------ Job ID---- Queue--- Job Name---- Project--- State-- Chunks Cores GPU    RAM Walltime Start Time-- CPU Hours  CPU% Progress End Time 2269135  large  job1   PANDORA  Running   1  24  -   1.0Gb 20d 20h 11-Jun 04:45  1297.2 99.9%  10.8% 02-Jul 00:45 2281499  small job2 PANDORA  Running   1  24  -   1.0Gb 20d 20h 01-Jun 20:46  6654.6 99.7%  55.6% 22-Jun 16:46 2281502  small job3 PANDORA  Running   1  24  -   1.0Gb 20d 20h 01-Jun 20:48  6649.7 99.7%  55.6% 22-Jun 16:48  * Times with an asterix are estimates only  * End time is start time + walltime so job may finish earlier  * Progress is accumulated walltime vs specified walltime - so see above System Status -------------------------------------------------------------------------------------------------------- CPU hours for jobs currently executing: 1033302.8 CPU hours for jobs queued:       176983.7 Storage Quota Usage ------------------------------------------------ /home               abcd1234    9.018G     10G /project        RDS-TEST-PANDORA-RW    184.2G      1T Storage Usage (Filesystems totals) --------------------------------- Filesystem Used   Free /scratch  378.1Tb 6.6%
Alternatively, you can use standard PBS Professional "qstat" commands to monitor jobs. A brief set of useful qstat commands is shown below. For more commands, see the PBS Professional user manual.
Command | Description |
---|---|
qstat -u abcd1234 | show status of abcd1234’s jobs |
qdel 1234567 | delete job 1234567 from queue |
qstat | show status of all jobs |
qstat -f 1234567 | show detailed stats for job 1234567 |
qstat -xf 1234567 | show detailed stats for job 1234567, even after it has finished |
When jobs finish, they produce three output files. One for standard output, one for standard error and a resource usage file. The file formats are as follows:
<JobName>.o<JobID> – Standard output file <JobName>.e<JobID> – Standard error file <JobName>.o<JobID>_usage – Resource usage file
If you don’t redirect standard output or standard error to a file, they will be printed in the .o
or the .e
files and only appear after your jobs finish. These files may contain useful information about why your job terminated before it finished.
The resource usage file contains details about how long your job ran for and also the memory used by your job. You can use the information in the resource usage file to optimise your walltime and memory requests for future jobs. An example resource usage file is shown below:
Job Id: 1050977.pbsserver for user abcd1234 in queue small Job Name: TestJob Project: RDS-ICT-PANDORA-RW Exit Status: 0 Walltime requested: 00:03:00 : Walltime used: 00:01:36 Cpus requested: 48 : Cpu Time: 00:36:38 : Cpu percent: 3102 Mem requested: 8gb : Mem used: 2342348kb VMem requested: None : VMem used: 2342348kb PMem requested: None : PMem used: None