Running Commands with GPU Reservation
The run
command is the most common way to use canhazgpu. It reserves GPUs, runs your command with proper environment setup, and automatically cleans up when done.
Basic Usage
The --
separator is important - it tells canhazgpu where its options end and your command begins.
Options
--gpus, -g
: Number of GPUs to reserve (default: 1)--gpu-ids
: Specific GPU IDs to reserve (comma-separated, e.g., 1,3,5)--timeout, -t
: Maximum time to run command before killing it (optional)
GPU Selection
- Use
--gpus
to let canhazgpu select GPUs using the LRU algorithm - Use
--gpu-ids
when you need specific GPUs (e.g., for hardware requirements) - You can use both options together if
--gpus
matches the GPU ID count or is 1 (default)
Timeout formats supported:
- 30s
(30 seconds)
- 30m
(30 minutes)
- 2h
(2 hours)
- 1d
(1 day)
- 0.5h
(30 minutes with decimal)
Bash Completion
For tab completion to work with commands after --
, make sure you have installed the canhazgpu bash completion script. See the Installation Guide for details.
Common Examples
Single GPU Training
# Reserve 1 GPU for a PyTorch training script
canhazgpu run --gpus 1 -- python train.py
# With additional arguments
canhazgpu run --gpus 1 -- python train.py --batch-size 32 --epochs 100 --lr 0.001
# Reserve specific GPUs by ID
canhazgpu run --gpu-ids 2,3 -- python train.py
# Reserve specific GPUs with matching count (allowed)
canhazgpu run --gpus 2 --gpu-ids 2,3 -- python train.py
# Reserve specific GPUs with default --gpus (allowed)
canhazgpu run --gpu-ids 1,3,5 -- python train.py # --gpus defaults to 1
Multi-GPU Training
# PyTorch Distributed Training
canhazgpu run --gpus 2 -- python -m torch.distributed.launch --nproc_per_node=2 train.py
# Horovod training
canhazgpu run --gpus 4 -- horovodrun -np 4 python train.py
# Custom multi-GPU script
canhazgpu run --gpus 2 -- python multi_gpu_train.py --world-size 2
Inference and Serving
# vLLM model serving
canhazgpu run --gpus 1 -- vllm serve microsoft/DialoGPT-medium --port 8000
# TensorRT inference
canhazgpu run --gpus 1 -- python inference.py --model model.trt
# Jupyter notebook server
canhazgpu run --gpus 1 -- jupyter notebook --ip=0.0.0.0 --port=8888
How It Works
When you run canhazgpu run --gpus 2 -- python train.py
, here's what happens:
- GPU Validation: Uses nvidia-smi to check actual GPU usage
- Conflict Detection: Identifies GPUs in use without proper reservations
- Allocation: Reserves 2 GPUs using LRU (Least Recently Used) strategy
- Environment Setup: Sets
CUDA_VISIBLE_DEVICES
to the allocated GPU IDs (e.g., "0,3") - Command Execution: Runs
python train.py
with the GPU environment - Heartbeat: Maintains reservation with periodic heartbeats while running
- Cleanup: Automatically releases GPUs when the command exits
Environment Variables
The run
command automatically sets:
CUDA_VISIBLE_DEVICES
: Comma-separated list of allocated GPU IDs- Your command sees only the reserved GPUs as GPU 0, 1, 2, etc.
Example: If GPUs 1 and 3 are allocated, CUDA_VISIBLE_DEVICES=1,3
is set, and your PyTorch code will see them as cuda:0
and cuda:1
.
Advanced Usage
Long-Running Commands
# Training that might take days
canhazgpu run --gpus 4 -- python long_training.py --checkpoint-every 1000
# Background processing
nohup canhazgpu run --gpus 1 -- python process_data.py > output.log 2>&1 &
Commands with Timeout
# Prevent runaway processes - kill after 2 hours
canhazgpu run --gpus 1 --timeout 2h -- python train.py
# Short timeout for testing
canhazgpu run --gpus 1 --timeout 30m -- python test_model.py
# Daily batch job with 12-hour limit
canhazgpu run --gpus 2 --timeout 12h -- python daily_processing.py
Default Timeout Configuration
You can set a default timeout in your configuration file to avoid specifying it every time:
Complex Commands
# Multiple commands in sequence
canhazgpu run --gpus 1 -- bash -c "python preprocess.py && python train.py && python evaluate.py"
# Commands with pipes and redirects
canhazgpu run --gpus 1 -- bash -c "python train.py 2>&1 | tee training.log"
Resource-Intensive Applications
# High-memory workloads
canhazgpu run --gpus 2 -- python large_model_training.py --model-size xl
# Distributed computing frameworks
canhazgpu run --gpus 4 -- dask-worker --nthreads 1 --memory-limit 8GB
Error Handling
Insufficient GPUs
❯ canhazgpu run --gpus 3 -- python train.py
Error: Not enough GPUs available. Requested: 3, Available: 2 (1 GPUs in use without reservation - run 'canhazgpu status' for details)
When this happens:
1. Check canhazgpu status
to see current allocations
2. Wait for other jobs to complete
3. Contact users with unreserved GPU usage
4. Reduce the number of requested GPUs
Command Failures
If your command fails: - GPUs are still automatically released - Check the command syntax and file paths - Verify your Python environment and dependencies
Allocation Failures
❯ canhazgpu run --gpus 1 -- python train.py
Error: Failed to acquire allocation lock after 5 attempts
This indicates high contention. Try again in a few seconds.
Best Practices
Resource Planning
- Estimate GPU needs: Start with fewer GPUs and scale up if needed
- Monitor memory usage: Use
nvidia-smi
during training to optimize allocation - Test with small datasets: Verify your code works before requesting many GPUs
Command Structure
- Use absolute paths: Avoid relative paths that might not work in different contexts
- Handle signals properly: Ensure your code can be interrupted gracefully
- Save checkpoints frequently: In case of unexpected termination
Error Recovery
# Save intermediate results
canhazgpu run --gpus 2 -- python train.py --save-every 100 --resume-from checkpoint.pth
# Use built-in timeout instead of system timeout
canhazgpu run --gpus 1 --timeout 1h -- python train.py
Integration with Job Schedulers
SLURM Integration
#!/bin/bash
#SBATCH --job-name=gpu_training
#SBATCH --nodes=1
#SBATCH --ntasks=1
# Use canhazgpu within SLURM job
canhazgpu run --gpus 2 -- python train.py
Systemd Services
[Unit]
Description=GPU Training Service
After=network.target
[Service]
Type=simple
User=researcher
WorkingDirectory=/home/researcher/project
ExecStart=/usr/local/bin/canhazgpu run --gpus 1 -- python service.py
Restart=always
[Install]
WantedBy=multi-user.target
Monitoring and Debugging
Resource Usage
# Monitor GPU usage while job runs
watch -n 5 nvidia-smi
# Check heartbeat status
canhazgpu status # Look for "last heartbeat" info
Log Analysis
# Capture all output
canhazgpu run --gpus 1 -- python train.py 2>&1 | tee full_log.txt
# Monitor log in real-time
canhazgpu run --gpus 1 -- python train.py 2>&1 | tee training.log &
tail -f training.log
The run
command provides a robust, automatic way to manage GPU reservations for your workloads while ensuring fair resource sharing across your team.