canhazgpu

A GPU reservation tool for single host shared development systems

In shared development environments with multiple GPUs, researchers and developers often face conflicts when trying to use GPUs simultaneously, leading to out-of-memory errors, failed training runs, and wasted time debugging resource conflicts. This utility provides a simple reservation system that coordinates GPU access across multiple users and processes on a single machine, ensuring exclusive access to requested GPUs while automatically handling cleanup when jobs complete or crash, thus eliminating the frustration of "GPU already in use" errors and enabling efficient collaborative development.

Quick Example

# Initialize GPU pool
canhazgpu admin --gpus 8

# Check status 
canhazgpu status

# Run a training job with 2 GPUs
canhazgpu run --gpus 2 -- python train.py

# Reserve GPUs manually for 4 hours
canhazgpu reserve --gpus 1 --duration 4h

# Release manual reservations
canhazgpu release

# Release specific GPUs
canhazgpu release --gpu-ids 1,3

# View usage reports
canhazgpu report --days 7

# Start web dashboard
canhazgpu web --port 8080

Key Features

✅ Race condition protection: Uses Redis-based distributed locking
✅ Manual reservations: Reserve GPUs for specific durations
✅ Automatic cleanup: GPUs auto-released when processes end or reservations expire
✅ LRU allocation: Fair distribution using least recently used strategy
✅ Heartbeat monitoring: Detects crashed processes and reclaims GPUs
✅ Unreserved usage detection: Identifies GPUs in use without proper reservations
✅ User accountability: Shows which users are running unreserved processes
✅ Real-time validation: Uses nvidia-smi to verify actual GPU usage
✅ Smart allocation: Automatically excludes unreserved GPUs from allocation
✅ Usage reporting: Track and analyze GPU usage patterns over time
✅ Web dashboard: Real-time monitoring interface with status and reports

Status Display

❯ canhazgpu status
GPU STATUS    USER     DURATION    TYPE    MODEL            DETAILS                    VALIDATION
--- --------- -------- ----------- ------- ---------------- -------------------------- ---------------------
0   available          free for 30m                                                   45MB used
1   in use    alice    15m 30s     run     llama-2-7b-chat  heartbeat 5s ago          8452MB, 1 processes
2   in use    bob                                           WITHOUT RESERVATION        1024MB used by PID 12345 (python3), PID 67890 (jupyter)
3   in use    charlie  1h 2m 15s   manual                   expires in 3h 15m 45s     no usage detected

Getting Started

Install dependencies - Redis server and Go
Quick start guide - Get up and running in minutes
Configuration - Set defaults and customize behavior
Commands overview - Learn all available commands
Administration setup - Configure for your environment

Use Cases

ML/AI Research Teams: Coordinate GPU access across multiple researchers
Shared Workstations: Prevent conflicts on multi-GPU development machines
Training Pipelines: Ensure exclusive GPU access for long-running jobs
Resource Monitoring: Track unreserved GPU usage and enforce policies

Documentation

User Guides

Installation - Install dependencies
Quick Start - Get started in minutes
Configuration - Configure defaults and customize behavior
Commands Overview - All available commands

Detailed Usage

Running Jobs - GPU reservation with run command
Manual Reservations - Reserve GPUs manually
Releasing GPUs - Release GPU reservations
Status Monitoring - Monitor GPU usage and reservations

Key Features

GPU Validation - Real-time usage validation
Unreserved Detection - Find unauthorized GPU usage
LRU Allocation - Fair GPU distribution strategy

Administration

Installation Guide - Dependencies and installation
Troubleshooting - Common issues and solutions

Development

Architecture - System design overview
Contributing - Contribution guidelines
Testing - Testing procedures
Release Process - Release management