canhazgpu
A GPU reservation tool for single host shared development systems
In shared development environments with multiple GPUs, researchers and developers often face conflicts when trying to use GPUs simultaneously, leading to out-of-memory errors, failed training runs, and wasted time debugging resource conflicts. This utility provides a simple reservation system that coordinates GPU access across multiple users and processes on a single machine, ensuring exclusive access to requested GPUs while automatically handling cleanup when jobs complete or crash, thus eliminating the frustration of "GPU already in use" errors and enabling efficient collaborative development.
Quick Example
# Initialize GPU pool
canhazgpu admin --gpus 8
# Check status
canhazgpu status
# Run a training job with 2 GPUs
canhazgpu run --gpus 2 -- python train.py
# Reserve GPUs manually for 4 hours
canhazgpu reserve --gpus 1 --duration 4h
# Release manual reservations
canhazgpu release
# Release specific GPUs
canhazgpu release --gpu-ids 1,3
# View usage reports
canhazgpu report --days 7
# Start web dashboard
canhazgpu web --port 8080
Key Features
- ✅ Race condition protection: Uses Redis-based distributed locking
- ✅ Manual reservations: Reserve GPUs for specific durations
- ✅ Automatic cleanup: GPUs auto-released when processes end or reservations expire
- ✅ LRU allocation: Fair distribution using least recently used strategy
- ✅ Heartbeat monitoring: Detects crashed processes and reclaims GPUs
- ✅ Unreserved usage detection: Identifies GPUs in use without proper reservations
- ✅ User accountability: Shows which users are running unreserved processes
- ✅ Real-time validation: Uses nvidia-smi to verify actual GPU usage
- ✅ Smart allocation: Automatically excludes unreserved GPUs from allocation
- ✅ Usage reporting: Track and analyze GPU usage patterns over time
- ✅ Web dashboard: Real-time monitoring interface with status and reports
Status Display
❯ canhazgpu status
GPU STATUS USER DURATION TYPE MODEL DETAILS VALIDATION
--- --------- -------- ----------- ------- ---------------- -------------------------- ---------------------
0 available free for 30m 45MB used
1 in use alice 15m 30s run llama-2-7b-chat heartbeat 5s ago 8452MB, 1 processes
2 in use bob WITHOUT RESERVATION 1024MB used by PID 12345 (python3), PID 67890 (jupyter)
3 in use charlie 1h 2m 15s manual expires in 3h 15m 45s no usage detected
Getting Started
- Install dependencies - Redis server and Go
- Quick start guide - Get up and running in minutes
- Configuration - Set defaults and customize behavior
- Commands overview - Learn all available commands
- Administration setup - Configure for your environment
Use Cases
- ML/AI Research Teams: Coordinate GPU access across multiple researchers
- Shared Workstations: Prevent conflicts on multi-GPU development machines
- Training Pipelines: Ensure exclusive GPU access for long-running jobs
- Resource Monitoring: Track unreserved GPU usage and enforce policies
Documentation
User Guides
- Installation - Install dependencies
- Quick Start - Get started in minutes
- Configuration - Configure defaults and customize behavior
- Commands Overview - All available commands
Detailed Usage
- Running Jobs - GPU reservation with run command
- Manual Reservations - Reserve GPUs manually
- Releasing GPUs - Release GPU reservations
- Status Monitoring - Monitor GPU usage and reservations
Key Features
- GPU Validation - Real-time usage validation
- Unreserved Detection - Find unauthorized GPU usage
- LRU Allocation - Fair GPU distribution strategy
Administration
- Installation Guide - Dependencies and installation
- Troubleshooting - Common issues and solutions
Development
- Architecture - System design overview
- Contributing - Contribution guidelines
- Testing - Testing procedures
- Release Process - Release management