How to create distributed inference workload

This section explains how to deploy distributed inference workload using NVIDIA run:ai UI and CLI.

Distributed means a single workload accessed with single endpoint URL but spread across many GPUs.

Keep in mind that inference workloads are affected by the project’s quota. They have assigned very-high priority by default, so they are non-preemptible workloads. Inference workloads are not allowed to go over the quota of your project’s.

In run:ai UI go to Workload manager tab > Workloads.

Click NEW WORKLOAD and select Inference from the dropdown menu.

How to create distributed inference workload

Select project in which you want to deploy inference.

Select inference type: - Custom – if you need to deploy a custom model or a different model than LLM - Hugging Face – if you need to deploy a model from Hugging Face - NVIDIA NIM – if you need to deploy a model from NVIIDIA NIM

For this tutorial we will deploy a model from Hugging Face:

How to create distributed inference workload

Enter a name of your workload and click CONTINUE.

On next page in Model section provide a model name as displayed in Hugging Face:

How to create distributed inference workload

If model require an access token you can provide it directly at this step or choose previously added from credential assets:

How to create distributed inference workload

Select GPU fraction, to deploy distributed inference for example on 4 GPUs where 1 GPU = 1 model replica select one-gpu fraction and in Replica autoscaling section select 4 minimum and 4 maximum replica.

How to create distributed inference workload

In Environment section select inference server and if needed set additional arguments:

How to create distributed inference workload

In Model store section select your data source – for example PVC. If you will not select a data source here model will be deployed on local storage of cluster node. It is recommended to use data source PVC for distributed inference to deploy models replica faster – this is important in case where workload will redeploy on different node.

How to create distributed inference workload

Click CREATE INFERENCE to start the deployment.

Depending on the size of model, deployment should take few minutes. Once the workload is up and running you can start using it.

You will find the workload endpoint in the Connection(s) column, if it is no visible you need to add it using the COLUMNS icon:

How to create distributed inference workload

How to create distributed inference workload

How to create distributed inference workload

The endpoint in run:ai will appear with http, you need to change it to https to be able to access it from internet.

Source: https://run-ai-docs.nvidia.com/self-hosted/2.23/workloads-in-nvidia-run-ai/using-inference/custom-inference

Updated on 21 Jan 2026