Attention: : Confluence is not suitable for the storage of highly confidential data. Please ensure that any data classified as Highly Protected is stored using a more secure platform.
If you have any questions, please refer to the University's data classification guide or contact ict.askcyber@sydney.edu.au

Why are my jobs not starting?

It is the responsibility of the job scheduler to determine when and where jobs will be run. The scheduler is a live system and regularly re-­prioritises work based on the following considerations:


Fair Share

  • Fair share prioritises jobs based on each project's recent usage of Artemis. Jobs don't run on a "first come, first served" basis.
  • If you are part of a busy project you may, as an individual, get less CPU time than someone in a project that is not using the system heavily.

Job Size

  • If you submit jobs asking for more than 288 cores, your job will never get to run.
  • If you have currently running jobs, your queued jobs cannot start unless the resulting total number of cores used would still remain below the 288 core limit (e.g. your 90 core job can't start if you are already using 200 cores).
  • If you submit jobs asking for more resources than available (i.e., memory) your job will never get the opportunity to run.
  • Asking for a relatively large resource allocation (i.e., lots of CPUs instead of just a few, or all CPUs on a single node) means the scheduler must wait for current jobs to complete and schedule future jobs in such a way as to leave a "hole" for your job to run which may result in a wait time despite there appearing to be resources free.
    For example, asking for 240 cores on 10 nodes will require the scheduler to wait for any and all jobs to finish on 10 nodes (approx. 1/5th of the total capacity) before your job can run, even though there may already be 240 cores free across the entire cluster.
  • Freeing this level of contiguous resource can take time, as there may be a mixture of long running and short running jobs previously scheduled and running.
  • If you run small jobs you are more likely to "fill up the gaps", however several people all wanting to use large resource allocations may compete.

Node capacity limits

  • Be aware of the core and memory limit for each node, asking for more than available may mean your job will never get to run.
  • Be aware of system overheads when requesting memory - 128GB nodes have closer to 123GB available to jobs while 512GB nodes have just over 500GB available.

Batching jobs

  • If you "batch up" jobs, which individually can run, but collectively consume significant resources, the system will run them in a suitable manner (i.e., keeping you below the 288 core CPU limit) and possibly also lower later jobs priorities due to fair share. This means that whilst your jobs are running, some jobs may end up with high wait times.

Time

  • Some jobs may finish sooner than their set wall time. This means that your estimated start time may change to an earlier time (if other jobs finish early, are cancelled, fail etc.) or a later time if jobs are scheduled that have higher priority than yours.