schedulingKubernetesGPU

Designing Scheduler Plugins for NVLink-Connected RISC-V + GPU Nodes

UUnknown

2026-02-27

11 min read

Build topology-aware Kubernetes and Slurm scheduler plugins that use NVLink Fusion on RISC‑V nodes to optimize multi‑GPU AI workloads and reduce cross‑node traffic.

Hook: Why NVLink-aware scheduling matters for RISC-V + GPU clusters

You manage GPU-bound AI workloads on emerging RISC-V servers connected to NVIDIA GPUs via NVLink Fusion. Your users complain about unpredictable throughput, high inter-GPU traffic costs, and poor multi-GPU scaling. Standard Kubernetes and Slurm scheduling treats GPUs as fungible devices — it doesn’t know which GPUs share ultra-high-bandwidth NVLink links or which GPUs are separated by slower PCIe hops. The result: suboptimal placements, long training epochs, and frustrated developers.

In 2026, with SiFive and other vendors integrating NVLink Fusion into RISC-V platforms and broader NVLink adoption across clouds, building topology-aware scheduler plugins is no longer experimental — it’s required to get predictable performance and efficient resource usage from GPU farms. This guide walks you through designing production-grade scheduler plugins for both Kubernetes and Slurm that understand NVLink topologies and optimize GPU-bound workloads.

The problem space in 2026: why topology awareness is critical

Key 2025–2026 trends drive this need:

SiFive’s NVLink Fusion integration with RISC‑V IP (announced late‑2025/early‑2026) means RISC‑V servers can directly leverage GPU fabrics, changing server-level topology assumptions.
AI training and inference increasingly use multi‑GPU, GPU‑to‑GPU memory‑heavy patterns that are highly sensitive to link bandwidth and latency.
Multi-cloud and hybrid deployments mix machines with different NVLink topologies — fully connected islands, partial meshes, or single-link pairs — so a one-size-fits-all allocation policy fails.

Rule of thumb: for bandwidth‑sensitive workloads, placing GPUs that share NVLink links (or within the same NVLink island) can produce >2× speedups versus naive placements that span non‑NVLink hops.

Core concepts your plugin must expose and use

NVLink Topology Graph: per node graph of GPUs and NVLink edges with bandwidth/latency weights.
GPU Groups / Islands: connected components or cliques in the NVLink graph that should be scheduled together for P2P workloads.
MIG-awareness: support GPU partitions (MIG slices) and map NVLink edges to MIG namespaces where possible.
CPU affinity mapping: bind RISC‑V CPU threads to GPUs that have closest NVLink attachment if kernels rely on CPU↔GPU shared memory semantics.
Policy primitives: pack vs spread, minimize cross-island traffic, cost functions combining bandwidth and cloud egress cost.

Detecting NVLink Fusion topology: practical tooling and telemetry

Start with node-level discovery. Typical methods in 2026 include:

NVML (NVIDIA Management Library): the authoritative API for querying NVLink peers and per-link bandwidth/attributes. Use the NVML bindings (C/Go/Python) on each node to build the topology graph.
nvidia-smi topo -m: quick CLI snapshot for human inspection; not ideal for automation but useful for validation.
DCGM (Data Center GPU Manager): continuous metrics (link utilization, ECC errors, GPU utilization) and a Prometheus exporter for cluster observability.
hwloc and PCI topology: can provide CPU/GPU NUMA and PCIe locality; combine with NVML for a complete picture.

Example: building a topology discovery agent (pseudo-Go)

package main

// Pseudocode: use NVML to enumerate GPUs and NVLink peers
func discoverTopology() NodeTopology {
    nvml.Init()
    defer nvml.Shutdown()

    var topo NodeTopology
    for i := 0; i < nvml.DeviceCount(); i++ {
        dev := nvml.Device(i)
        id := dev.UUID()
        topo.AddNode(id)
        for link := 0; link < dev.NvlinkLinkCount(); link++ {
            if peer, ok := dev.NvlinkPeerDevice(link); ok {
                bw := dev.NvlinkBandwidth(link)
                topo.AddEdge(id, peer.UUID(), bw)
            }
        }
    }
    return topo
}

Persist the discovered graph as JSON and export it as a Node label or a small HTTP endpoint (e.g., /metrics/topology) for the scheduler to fetch.

Designing the Kubernetes plugin

For Kubernetes (2026), implement a combination of:

Device Plugin to advertise GPU groups as allocatable resources (nvidia.com/gpu-group/), leveraging the Device Plugin API Allocation and Mount semantics.
Scheduler Framework Plugin implementing Filter and Score phases so the kube-scheduler can prefer nodes and GPUs that satisfy NVLink locality.
Admission/Mutating webhook or an Operator to translate high-level Pod requests (e.g., request for 2 GPUs with nvlink locality) into explicit resource requests/affinities.

Device plugin behavior

The device plugin should:

Read the local NVLink graph and partition GPUs into groups/islands.
Expose each island as a virtual device (resource name: nvidia.com/gpu-group-) and expose MIG slices as subdevices when needed.
On allocation, return environment variables and device nodes for the exact GPUs assigned (CUDA_VISIBLE_DEVICES style).

Scheduler plugin design

Use the kube-scheduler's framework APIs. Implement these points:

FilterPlugin: reject nodes that do not include the requested GPU group or where the requested count cannot be satisfied within a single NVLink island if the Pod specifies locality=true.
ScorePlugin: score nodes higher when the node offers GPUs with more NVLink bandwidth among selected GPUs, or when the node reduces expected cross-node traffic.
Permit/Reserve phases: optionally reserve GPU groups for longer-running batch jobs to avoid fragmentation.

Go snippet: Filter and Score (simplified)

// Filter: ensure node has an island that can provide `count` GPUs
func (p *NVLinkFilter) Filter(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    requested := pod.Annotations["gpu.request.count"]
    if !nodeHasIslandWithCount(nodeInfo, requested) {
        return framework.NewStatus(framework.Unschedulable, "no suitable NVLink island")
    }
    return framework.NewStatus(framework.Success, "")
}

// Score: give more points to nodes where selected GPUs have higher summed link bandwidth
func (p *NVLinkScore) Score(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
    bw := computeIslandBandwidth(nodeName, pod)
    score := normalizeBandwidthToScore(bw)
    return score, framework.NewStatus(framework.Success, "")
}

Pod request model

Clients should request GPUs with a simple annotation model (the plugin’s mutating webhook can translate ergonomic labels to resources):

apiVersion: v1
kind: Pod
metadata:
  name: train
  annotations:
    gpu.request.count: "4"
    gpu.request.locality: "nvlink"    # nvlink|any
spec:
  containers:
  - name: trainer
    image: mymodel:latest
    resources:
      limits:
        nvidia.com/gpu-group: 1  # device plugin will map to actual GPUs

Designing the Slurm plugin

Slurm is widely used in HPC and continues to be a first-class scheduler in 2026. For NVLink-aware scheduling, extend Slurm with one or both of these approaches:

GRES (Generic RESource) definitions: export GPU groups as GRES entries and let Slurm treat them as resources (e.g., gpu_group:g0).
Custom Select/Job Submit Plugins: implement a select plugin (or patch select/cons_res) that understands NVLink islands and enforces placement constraints at allocation time.

Example: gres.conf and slurm.conf entries

# /etc/slurm/gres.conf
Name=gpu Type=tesla File=/dev/nvidia0,CARD=0000:01:00.0
Name=gpu Type=tesla File=/dev/nvidia1,CARD=0000:02:00.0
# define NVLink island groups
Name=gpu_group Type=island Count=2 GPUs=0,1 Name=g0
Name=gpu_group Type=island Count=2 GPUs=2,3 Name=g1

# /etc/slurm/slurm.conf (relevant lines)
GresTypes=gpu,gpu_group
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

# Node definition example
NodeName=rv-node001 Gres=gpu:4,gpu_group:2 RealMemory=524288 Sockets=2 CoresPerSocket=64
PartitionName=gpu nodes=rv-node001 Default=YES MaxTime=INFINITE State=UP

Then users can request an NVLink island explicitly:

# sbatch example
sbatch --gres=gpu_group:g0:1 --wrap="srun --gres=gpu:2 ./train.sh"

Custom select plugin behavior

Implement logic in the select plugin to:

Load the node's NVLink graph (from local agent output or a central metadata store).
When a job asks for multiple GPUs or a GPU group, search for a set of GPUs within a single island (or minimize inter-island links) and bind them.
Respect MIG slices and ensure allocations do not overlap.

Allocation algorithms and heuristics

Scheduling decisions require a cost model. Below are practical algorithms you can implement in both Kubernetes and Slurm plugins.

1) Island-first packing (fast, deterministic)

Find the smallest island that can satisfy the request. Good for maximizing P2P bandwidth and reducing cross-island fragmentation. Use when most jobs are GPU‑heavy and require low latency.

2) Weighted bandwidth scoring (flexible)

Compute a score for candidate sets S of GPUs:

score(S) = alpha * sum(pairwise_bandwidths(S)) - beta * cross_node_penalty - gamma * preemption_cost

Tune alpha/beta/gamma based on workload priorities.

3) Graph partitioning for gang scheduling

For multi‑task distributed training (Horovod, NCCL), treat the problem as partitioning the node’s NVLink graph to satisfy multiple simultaneous gang jobs. Use heuristics like greedy maximal clique extraction or METIS-style partitioning when islands are large.

4) MIG-aware fractional allocations

When MIG is in use, think in terms of MIG namespaces. Your device agent must maintain mapping from MIG slices to parent GPU NVLink links and ensure that combined MIG slices scheduled together still achieve desired locality.

Observability and metrics — what to collect

To tune and verify scheduling behavior collect these metrics (use DCGM exporters + custom NVML metrics):

Per-NVLink link utilization (bytes/sec), errors, and utilization %
Per-GPU memory utilization and PCIe traffic
MIG slice mapping and utilization
Pod/Job-to-GPU mapping events (allocations, releases)
Scheduling outcomes and placement latency

Sample PromQL queries

# avg nvlink utilization across GPUs on node
avg by(node) (rate(dcgm_nvlink_bytes{node="rv-node001"}[1m]))

# identify jobs with cross-island traffic (heuristic)
sum by(job) (rate(dcgm_pcie_tx_bytes[1m])) > 0.8 * sum by(job) (rate(dcgm_nvlink_bytes[1m]))

Operational considerations and edge cases

Dynamic topology changes: some platforms may add/remove GPUs or change NVLink topology at runtime. Reconcile dynamically and provide graceful eviction or rescheduling policies for affected jobs.
Cross-node multi-GPU training: if a job spans nodes, prefer nodes with per-node islands that reduce inter-node communication, and use network topology (Infiniband/NCCL) information to place complementary parts of the job.
Failover and preemption: preemption of island-bound jobs is expensive. Prefer preemption policies that respect NVLink locality to avoid thrashing.
Security and permissions: Device discovery requires elevated privileges to call NVML or access /dev. Run node agents as DaemonSets with restricted capabilities and use signed images and RBAC appropriately.

Case study: accelerating a 4-GPU ResNet training on RISC‑V NVLink node

Situation: a 4-GPU job initially scheduled naïvely across GPUs {0,2,3,5} where only two pairs shared NVLink links. Epoch time: 420s.

Action: using the NVLink-aware scheduler plugin we updated the Device Plugin to create island groups {0,1,2,3} and {4,5,6,7}. The scheduler filtered and scheduled the job onto GPUs {0,1,2,3} in the same NVLink island.

Result: epoch time reduced to 190s (~2.2× speedup). DCGM metrics showed a 60% decrease in PCIe cross‑traffic and 35% more consistent link utilization. The cluster reclaimed bandwidth-limited acceleration and avoided cross-node NCCL fallbacks.

Multi-cloud & hybrid deployments: federated topology awareness

In multi-cloud environments, not all instances provide NVLink Fusion capability — you must federate topology metadata:

Have each cloud region publish node topology metadata to a central control plane (securely).
Implement policy layers that include cloud egress/transfer cost into placement scoring (e.g., favor intra-node NVLink over inter-region high‑cost moves).
Expose topology capability labels (e.g., topology.k8s.io/nvlink.v1=true) so users can select clusters with NVLink Fusion.

Testing and validation checklist

Unit test topology parsing (NVML mocks).
Integration test device allocation with real MIG and CUDA workloads.
End-to-end load tests with NCCL all‑reduce patterns to measure effective bandwidth and epoch times.
Chaos tests: remove a GPU or island mid-job and validate reschedule/eviction behavior.

Security and compliance

Ensure:

Node agents run with least privilege and only expose necessary endpoints.
Topology metadata is integrity‑protected (signing) before the scheduler uses it.
Audit logs for allocations and device access are retained to satisfy compliance.

Future-proofing and 2026+ predictions

As of 2026, NVLink Fusion is accelerating adoption across cloud and on-prem RISC‑V server platforms. Expect:

Increasing standardization around topology reporting APIs from vendors (NVML/extended hwloc) — design your plugin to be modular so adapters can be swapped.
Scheduler frameworks (both cloud-native and HPC) adding native NVLink primitives — your plugin should be ready to migrate from webhook/Device Plugin hacks to native primitives.
More heterogeneity: GPUs from different generations, asymmetric NVLink links, and disaggregated memory fabrics — scheduler policy must evolve to support weighted link models and memory coherency constraints.

Actionable checklist to implement an NVLink-aware scheduler

Deploy a topology discovery DaemonSet using NVML/DCGM on each node and export JSON topology to a local endpoint.
Implement a device plugin that exposes NVLink islands as resource units and maps allocations to concrete GPUs (support MIG).
Develop a scheduler framework plugin (Filter + Score) for Kubernetes. For Slurm, define GRES and/or implement a select plugin.
Instrument with DCGM and Prometheus; track NVLink and PCIe metrics to validate placement benefits.
Create job-level policies (pack vs spread) and provide easy-to-use Pod annotations or sbatch flags.
Run end-to-end benchmark suites (NCCL all-reduce, model training) and tune scoring weights for your workload mix.

Quick reference templates

Minimal Node label for Kubernetes

kubectl label node rv-node001 topology.k8s.io/nvlink.islands='{"g0":[0,1],"g1":[2,3]}'

Sample Slurm sbatch (request NVLink island)

sbatch --gres=gpu_group:g0:1 --cpus-per-task=16 --wrap="srun --gres=gpu:4 ./train.sh"

Summary and closing recommendations

To extract predictable performance from RISC‑V + GPU nodes using NVLink Fusion, your schedulers must be topology-aware. Build a layered solution: node discovery agents + device plugins + scheduler plugins (Kubernetes Scheduler Framework or Slurm select/GRES), combined with robust telemetry. Use island-first heuristics for latency-sensitive jobs and weighted-bandwidth scoring for mixed workloads. Prioritize instrumented validation using DCGM and design for multi-cloud heterogeneity.

Call to action

If you’re planning a pilot or need a reference implementation, start with the discovery daemon and device-plugin prototype. We maintain a sample repo with discovery agents, a Kubernetes scheduler plugin skeleton, and Slurm select plugin examples to accelerate your implementation. Reach out for an architecture review and a hands‑on workshop to adapt these patterns to your fleet.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.