GPU Cluster

Attention: Confluence is not suitable for the storage of highly confidential data. Please ensure that any data classified as Highly Protected is stored using a more secure platform.
If you have any questions, please refer to the University's data classification guide or contact ict.askcyber@sydney.edu.au

GPU Cluster

Sydney GPU Cluster - Early access available

Researchers are able to get early access to the Sydney GPU cluster through the Sydney Informatics Hub.

To onboard your research group, please fill out this onboarding form

For enquries and support please use the https://sydney.au1.qualtrics.com/jfe/form/SV_5mXyhFZsPIwZDBs

Standard Acknowledgement:

The authors acknowledge the use of the Sydney GPU Cluster, a service provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney.

Run:ai Interface

 

Early access available now! To get onboard, you will need to provide Sydney Informatics Hub with your Dashr Portal research project code, and PC/RC account codes for usage beyond any testing allocation, and we will manually add you to the system.

User onboarding guide for USyd researchers

University of Sydney researchers can use the cluster through NVIDIA Run:ai, (documentation available here).

Soon: automated role-based access control via your research project on the university’s Dashr Portal. (ETA: May 2026)

This provides:

runai.jpeg
The Run:ai interface for interactively running workloads

Services

We will progressively be rolling out managed services backed by compute on the GPU cluster. These include:

Now available

Jupyter and Marimo Notebooks

These enable interactive use of the cluster for researchers in a web interface.

Gradio Apps

Gradio Apps (in python), including those you might find on hugginface spaces are able to run on the Sydney GPU Cluster as well

Coming Soon

OpenWebUI

  • An interactive chat platform with OpenWebUI, enabling live analysis and AI code execution.

sih_gpu.jpg
Interactive data analysis and code execution in OpenWebUI

LabelStudio

  • A data labelling platform with Label Studio, for images, text, video, audio or other data types, and backed by AI assisted labelling models running on the cluster e.g. Segment Anything 2.

labelling.jpg
Some examples of data labelling possible in Label Studio - text, photos, microscopy, aerial photography.

Globus Data Transfer

Triton Inference Server

Cluster Specifications

DGX H200 GPU nodes

3 x NVIDIA DGX H200 nodes with 8 x H200 per node, for a total of 24 H200 GPUs.

 

H200 GPUs

VRAM (Gb)

CPU cores

RAM (Gb)

Disk (Tb)

Compute
FP64 dense TFLOPS
(Double Precision)

Service Units (SU)
per hour

 

H200 GPUs

VRAM (Gb)

CPU cores

RAM (Gb)

Disk (Tb)

Compute
FP64 dense TFLOPS
(Double Precision)

Service Units (SU)
per hour

Per chunk

(1 / 14) th

10.07

1

19.6

0.3

2.4

36

Per H200 GPU

1

141

14

275

3.8

34

504

Per DGX Node (8x H200 GPUs)

8

1128

112

2200

30.7

272

4032

Whole cluster

24

3384

336

6600

92.2

816

12096

Storage

  • DDN EXAScaler Parallel File System - 1 Petabyte

    • Provides high-performance shared storage to DGX nodes

    • Connected via 8x 200GbE Active Optical Cables

1 x Controller: DDN ES400NVX2-NDR200-SE with:

Networking

  • Cumulus Linux

    • Runs on NVIDIA Spectrum-2 200GbE switches

    • Supports MLAG, VLANs, and routing configurations

  • InfiniBand Networking for high-speed GPU-to-GPU communication

  • Out-of-Band (OOB) Management

Operating System

Job Scheduler & Cluster Management:

  • Kubernetes (K8s)

    • Three master nodes for managing the cluster

    • DGX nodes function as Kubernetes worker nodes

  • Run:AI

    • Installed on Kubernetes for workload scheduling and AI workload orchestration

    • Includes namespaces, secrets, backend, and API server configurations