Running Commands with GPU Reservation

The run command is the most common way to use canhazgpu. It reserves GPUs, runs your command with proper environment setup, and automatically cleans up when done.

Basic Usage

canhazgpu run [--gpus <count> | --gpu-ids <ids>] [--timeout <duration>] -- <command>

The -- separator is important - it tells canhazgpu where its options end and your command begins.

Options

--gpus, -g: Number of GPUs to reserve (default: 1)
--gpu-ids: Specific GPU IDs to reserve (comma-separated, e.g., 1,3,5)
--timeout, -t: Maximum time to run command before killing it (optional)

GPU Selection

Use --gpus to let canhazgpu select GPUs using the LRU algorithm
Use --gpu-ids when you need specific GPUs (e.g., for hardware requirements)
You can use both options together if --gpus matches the GPU ID count or is 1 (default)

Timeout formats supported: - 30s (30 seconds) - 30m (30 minutes) - 2h (2 hours)
- 1d (1 day) - 0.5h (30 minutes with decimal)

Bash Completion

For tab completion to work with commands after --, make sure you have installed the canhazgpu bash completion script. See the Installation Guide for details.

Common Examples

Single GPU Training

# Reserve 1 GPU for a PyTorch training script
canhazgpu run --gpus 1 -- python train.py

# With additional arguments
canhazgpu run --gpus 1 -- python train.py --batch-size 32 --epochs 100 --lr 0.001

# Reserve specific GPUs by ID
canhazgpu run --gpu-ids 2,3 -- python train.py

# Reserve specific GPUs with matching count (allowed)
canhazgpu run --gpus 2 --gpu-ids 2,3 -- python train.py

# Reserve specific GPUs with default --gpus (allowed)
canhazgpu run --gpu-ids 1,3,5 -- python train.py  # --gpus defaults to 1

Multi-GPU Training

# PyTorch Distributed Training
canhazgpu run --gpus 2 -- python -m torch.distributed.launch --nproc_per_node=2 train.py

# Horovod training
canhazgpu run --gpus 4 -- horovodrun -np 4 python train.py

# Custom multi-GPU script
canhazgpu run --gpus 2 -- python multi_gpu_train.py --world-size 2

Inference and Serving

# vLLM model serving
canhazgpu run --gpus 1 -- vllm serve microsoft/DialoGPT-medium --port 8000

# TensorRT inference
canhazgpu run --gpus 1 -- python inference.py --model model.trt

# Jupyter notebook server
canhazgpu run --gpus 1 -- jupyter notebook --ip=0.0.0.0 --port=8888

How It Works

When you run canhazgpu run --gpus 2 -- python train.py, here's what happens:

GPU Validation: Uses nvidia-smi to check actual GPU usage
Conflict Detection: Identifies GPUs in use without proper reservations
Allocation: Reserves 2 GPUs using LRU (Least Recently Used) strategy
Environment Setup: Sets CUDA_VISIBLE_DEVICES to the allocated GPU IDs (e.g., "0,3")
Command Execution: Runs python train.py with the GPU environment
Heartbeat: Maintains reservation with periodic heartbeats while running
Cleanup: Automatically releases GPUs when the command exits

Environment Variables

The run command automatically sets:

CUDA_VISIBLE_DEVICES: Comma-separated list of allocated GPU IDs
Your command sees only the reserved GPUs as GPU 0, 1, 2, etc.

Example: If GPUs 1 and 3 are allocated, CUDA_VISIBLE_DEVICES=1,3 is set, and your PyTorch code will see them as cuda:0 and cuda:1.

Advanced Usage

Long-Running Commands

# Training that might take days
canhazgpu run --gpus 4 -- python long_training.py --checkpoint-every 1000

# Background processing
nohup canhazgpu run --gpus 1 -- python process_data.py > output.log 2>&1 &

Commands with Timeout

# Prevent runaway processes - kill after 2 hours
canhazgpu run --gpus 1 --timeout 2h -- python train.py

# Short timeout for testing
canhazgpu run --gpus 1 --timeout 30m -- python test_model.py

# Daily batch job with 12-hour limit
canhazgpu run --gpus 2 --timeout 12h -- python daily_processing.py

Default Timeout Configuration

You can set a default timeout in your configuration file to avoid specifying it every time:

# ~/.canhazgpu.yaml
run:
  timeout: "2h"  # Default 2-hour timeout for all run commands

Complex Commands

# Multiple commands in sequence
canhazgpu run --gpus 1 -- bash -c "python preprocess.py && python train.py && python evaluate.py"

# Commands with pipes and redirects
canhazgpu run --gpus 1 -- bash -c "python train.py 2>&1 | tee training.log"

Resource-Intensive Applications

# High-memory workloads
canhazgpu run --gpus 2 -- python large_model_training.py --model-size xl

# Distributed computing frameworks
canhazgpu run --gpus 4 -- dask-worker --nthreads 1 --memory-limit 8GB

Error Handling

Insufficient GPUs

❯ canhazgpu run --gpus 3 -- python train.py
Error: Not enough GPUs available. Requested: 3, Available: 2 (1 GPUs in use without reservation - run 'canhazgpu status' for details)

When this happens: 1. Check canhazgpu status to see current allocations 2. Wait for other jobs to complete 3. Contact users with unreserved GPU usage 4. Reduce the number of requested GPUs

Command Failures

❯ canhazgpu run --gpus 1 -- python nonexistent.py
Error: Command failed with exit code 1

If your command fails: - GPUs are still automatically released - Check the command syntax and file paths - Verify your Python environment and dependencies

Allocation Failures

❯ canhazgpu run --gpus 1 -- python train.py
Error: Failed to acquire allocation lock after 5 attempts

This indicates high contention. Try again in a few seconds.

Best Practices

Resource Planning

Estimate GPU needs: Start with fewer GPUs and scale up if needed
Monitor memory usage: Use nvidia-smi during training to optimize allocation
Test with small datasets: Verify your code works before requesting many GPUs

Command Structure

Use absolute paths: Avoid relative paths that might not work in different contexts
Handle signals properly: Ensure your code can be interrupted gracefully
Save checkpoints frequently: In case of unexpected termination

Error Recovery

# Save intermediate results
canhazgpu run --gpus 2 -- python train.py --save-every 100 --resume-from checkpoint.pth

# Use built-in timeout instead of system timeout
canhazgpu run --gpus 1 --timeout 1h -- python train.py

Integration with Job Schedulers

SLURM Integration

#!/bin/bash
#SBATCH --job-name=gpu_training
#SBATCH --nodes=1
#SBATCH --ntasks=1

# Use canhazgpu within SLURM job
canhazgpu run --gpus 2 -- python train.py

Systemd Services

[Unit]
Description=GPU Training Service
After=network.target

[Service]
Type=simple
User=researcher
WorkingDirectory=/home/researcher/project
ExecStart=/usr/local/bin/canhazgpu run --gpus 1 -- python service.py
Restart=always

[Install]
WantedBy=multi-user.target

Monitoring and Debugging

Resource Usage

# Monitor GPU usage while job runs
watch -n 5 nvidia-smi

# Check heartbeat status
canhazgpu status  # Look for "last heartbeat" info

Log Analysis

# Capture all output
canhazgpu run --gpus 1 -- python train.py 2>&1 | tee full_log.txt

# Monitor log in real-time
canhazgpu run --gpus 1 -- python train.py 2>&1 | tee training.log &
tail -f training.log

The run command provides a robust, automatic way to manage GPU reservations for your workloads while ensuring fair resource sharing across your team.