Commands Overview
canhazgpu provides seven main commands for GPU management:
❯ canhazgpu --help
Usage: canhazgpu [OPTIONS] COMMAND [ARGS]...
Commands:
  admin    Initialize GPU pool for this machine
  release  Release manually reserved GPUs held by the current user
  report   Generate GPU usage reports
  reserve  Reserve GPUs manually for a specified duration
  run      Reserve GPUs and run a command with CUDA_VISIBLE_DEVICES set
  status   Show current GPU allocation status
  web      Start a web server for GPU status monitoring
Global Flags
All commands support these global configuration flags:
- --config: Path to configuration file (default:- $HOME/.canhazgpu.yaml)
- --redis-host: Redis server hostname (default: localhost)
- --redis-port: Redis server port (default: 6379)
- --redis-db: Redis database number (default: 0)
- --memory-threshold: Memory threshold in MB to consider a GPU as "in use" (default: 1024)
Configuration Methods:
- Command-line flags (highest priority)
- Environment variables with CANHAZGPU_prefix
- Configuration files in YAML, JSON, or TOML format
- Built-in defaults (lowest priority)
Examples:
# Use command-line flags
canhazgpu status --redis-host redis.example.com --redis-port 6380
# Use environment variables
export CANHAZGPU_MEMORY_THRESHOLD=512
export CANHAZGPU_REDIS_HOST=redis.example.com
canhazgpu status
# Use a configuration file
canhazgpu --config /path/to/config.yaml status
canhazgpu --config config.json run --gpus 2 -- python train.py
Configuration File Examples:
YAML format (~/.canhazgpu.yaml):
JSON format:
TOML format:
The --memory-threshold flag controls when a GPU is considered "in use without reservation". GPUs using more than this amount of memory will be excluded from allocation and flagged as unreserved usage.
admin
Initialize and configure the GPU pool.
Options:
- --gpus: Number of GPUs available on this machine (required)
- --force: Force reinitialization even if already initialized
Examples:
# Initial setup
canhazgpu admin --gpus 8
# Change GPU count (requires --force)
canhazgpu admin --gpus 4 --force
Destructive Operation
Using --force will clear all existing reservations. Use with caution in production.
status
Show current GPU allocation status with automatic validation.
# Table output (default)
canhazgpu status
# JSON output for programmatic use
canhazgpu status --json
canhazgpu status -j
Options:
- -j, --json: Output status as JSON array instead of table format
Examples:
# Standard status check
canhazgpu status
# JSON output for scripts and APIs
canhazgpu status --json
# Use a lower threshold to detect lighter GPU usage
canhazgpu status --memory-threshold 512
# Use a higher threshold to ignore small allocations
canhazgpu status --memory-threshold 2048
# Combine JSON with memory threshold
canhazgpu status --json --memory-threshold 512
Global Memory Threshold
The --memory-threshold flag is a global option that affects GPU usage detection across all commands. It can be set in your configuration file or used with any command that performs GPU validation.
Table Output Example:
GPU  STATUS      USER      DURATION     TYPE    MODEL                    DETAILS                   VALIDATION
---  ------      ----      --------     ----    -----                    -------                   ----------
0    AVAILABLE   -         -            -       -                        free for 0h 30m 15s      45MB used
1    IN_USE      alice     0h 15m 30s   RUN     meta-llama/Llama-2-7b-chat-hf  heartbeat 0h 0m 5s ago    8452MB, 1 processes
2    UNRESERVED  user bob  -            -       codellama/CodeLlama-7b-Instruct-hf        1024MB used by PID 12345 (python3), PID 67890 (jupyter)  -
3    IN_USE      charlie   1h 2m 15s    MANUAL  -                        expires in 3h 15m 45s    no usage detected
JSON Output Example:
[
  {
    "gpu_id": 0,
    "status": "AVAILABLE",
    "details": "free for 0h 30m 15s",
    "validation": "45MB used"
  },
  {
    "gpu_id": 1,
    "status": "IN_USE",
    "user": "alice",
    "duration": "0h 15m 30s",
    "type": "RUN",
    "details": "heartbeat 0h 0m 5s ago",
    "validation": "8452MB, 1 processes",
    "model": {
      "provider": "meta-llama",
      "model": "meta-llama/Llama-2-7b-chat-hf"
    }
  },
  {
    "gpu_id": 2,
    "status": "UNRESERVED",
    "details": "WITHOUT RESERVATION",
    "unreserved_users": ["bob"],
    "process_info": "1024MB used by PID 12345 (python3), PID 67890 (jupyter)",
    "model": {
      "provider": "codellama",
      "model": "codellama/CodeLlama-7b-Instruct-hf"
    }
  },
  {
    "gpu_id": 3,
    "status": "IN_USE",
    "user": "charlie",
    "duration": "1h 2m 15s",
    "type": "MANUAL",
    "details": "expires in 3h 15m 45s",
    "validation": "no usage detected"
  }
]
Status Types:
- AVAILABLE: GPU is free to use
- IN_USE: GPU is properly reserved
- UNRESERVED: GPU is being used without proper reservation
Table Columns:
- GPU: GPU ID number
- STATUS: Current state (AVAILABLE, IN_USE, UNRESERVED, ERROR)
- USER: Username who reserved the GPU (or who is using it unreserved)
- DURATION: How long the GPU has been reserved
- TYPE: Reservation type (RUN, MANUAL)
- MODEL: Detected AI model (if any)
- DETAILS: Additional information (heartbeat, expiry, process info)
- VALIDATION: Actual GPU usage validation (memory, process count)
run
Reserve GPUs and run a command with automatic cleanup.
Options:
- --gpus: Number of GPUs to reserve (default: 1)
- --gpu-ids: Specific GPU IDs to reserve (comma-separated, e.g., 1,3,5)
- --timeout: Maximum time to run command before killing it (default: none)
GPU Selection Options
You can use --gpus alone, --gpu-ids alone, or both together if:
- --gpus matches the number of GPU IDs specified, or
- --gpus is 1 (the default value)
If specific GPU IDs are requested and any are not available, the entire reservation will fail.
Timeout formats:
- 30s (30 seconds)
- 30m (30 minutes)
- 2h (2 hours)
- 1d (1 day)
- 0.5h (30 minutes with decimal)
Examples:
# Single GPU training
canhazgpu run --gpus 1 -- python train.py
# Multi-GPU distributed training
canhazgpu run --gpus 2 -- python -m torch.distributed.launch train.py
# Reserve specific GPU IDs
canhazgpu run --gpu-ids 1,3 -- python train.py
# Complex command with arguments
canhazgpu run --gpus 1 -- python train.py --batch-size 32 --epochs 100
# Training with timeout to prevent runaway processes
canhazgpu run --gpus 1 --timeout 2h -- python train.py
# Short timeout for testing
canhazgpu run --gpus 1 --timeout 30m -- python test_model.py
Behavior:
1. Validates actual GPU availability using nvidia-smi
2. Excludes GPUs that are in use without reservation
3. Reserves the requested number of GPUs using MRU-per-user allocation (with LRU fallback)
4. Sets CUDA_VISIBLE_DEVICES to the allocated GPU IDs
5. Runs your command
6. Automatically releases GPUs when the command finishes
7. Maintains a heartbeat while running to keep the reservation active
Error Handling:
❯ canhazgpu run --gpus 2 -- python train.py
Error: Not enough GPUs available. Requested: 2, Available: 1 (1 GPUs in use without reservation - run 'canhazgpu status' for details)
reserve
Manually reserve GPUs for a specified duration.
Options:
- --gpus: Number of GPUs to reserve (default: 1)
- --gpu-ids: Specific GPU IDs to reserve (comma-separated, e.g., 1,3,5)
- --duration: Duration to reserve GPUs (default: 8h)
GPU Selection Options
You can use --gpus alone, --gpu-ids alone, or both together if:
- --gpus matches the number of GPU IDs specified, or
- --gpus is 1 (the default value)
If specific GPU IDs are requested and any are not available, the entire reservation will fail.
Duration Formats:
- 30m: 30 minutes
- 2h: 2 hours
- 1d: 1 day
- 0.5h: 30 minutes (decimal values supported)
Examples:
# Reserve 1 GPU for 8 hours (default)
canhazgpu reserve
# Reserve 2 GPUs for 4 hours
canhazgpu reserve --gpus 2 --duration 4h
# Reserve specific GPU IDs
canhazgpu reserve --gpu-ids 0,2 --duration 2h
# Reserve 1 GPU for 30 minutes
canhazgpu reserve --duration 30m
# Reserve 1 GPU for 2 days
canhazgpu reserve --gpus 1 --duration 2d
Important Note:
Unlike the run command, reserve does NOT automatically set CUDA_VISIBLE_DEVICES. You must manually set it based on the GPU IDs shown in the output.
Use Cases: - Interactive development sessions - Jupyter notebook workflows - Preparing for batch jobs - Blocking GPUs for maintenance
release
Release manually reserved GPUs held by the current user.
Options:
- -G, --gpu-ids: Specific GPU IDs to release (comma-separated, e.g., 1,3,5)
Examples:
# Release all manually reserved GPUs
❯ canhazgpu release
Released 2 GPU(s): [1, 3]
# Release specific GPUs
❯ canhazgpu release --gpu-ids 1,3
Released 2 GPU(s): [1, 3]
❯ canhazgpu release  
No manually reserved GPUs found for current user
Scope
By default, releases all manually reserved GPUs. With --gpu-ids, can release specific GPUs including both manual reservations (from reserve command) and run-type reservations (from run command).
report
Generate GPU reservation reports showing historical reservation patterns by user.
Options:
- --days: Number of days to include in the report (default: 30)
Examples:
# Show reservations for the last 30 days (default)
canhazgpu report
# Show reservations for the last 7 days
canhazgpu report --days 7
# Show reservations for the last 24 hours
canhazgpu report --days 1
Example Output:
=== GPU Reservation Report ===
Period: 2025-05-31 to 2025-06-30 (30 days)
User                       GPU Hours      Percentage        Run     Manual
---------------------------------------------------------------------------
alice                          24.50          55.2%         12          8
bob                            15.25          34.4%          6         15
charlie                         4.60          10.4%          3          2
---------------------------------------------------------------------------
TOTAL                          44.35         100.0%         21         25
Total reservations: 46
Unique users: 3
Report Features: - Shows GPU hours consumed by each user - Percentage of total usage - Breakdown by reservation type (run vs manual) - Total statistics for the period - Includes both completed and in-progress reservations
web
Start a web server providing a dashboard for real-time monitoring and reports.
Options:
- --port: Port to run the web server on (default: 8080)
- --host: Host to bind the web server to (default: 0.0.0.0)
Examples:
# Start web server on default port 8080
canhazgpu web
# Start on a custom port
canhazgpu web --port 3000
# Bind to localhost only
canhazgpu web --host 127.0.0.1 --port 8080
# Run on a specific interface
canhazgpu web --host 192.168.1.100 --port 8888

The dashboard displays: - System hostname in the header for easy identification - GPU cards showing status, user, duration, and validation info - Color-coded status badges (green=available, blue=in use, red=unreserved) - Reservation report with usage statistics and visual bars - Quick links to documentation and GitHub repository
Dashboard Features:
- Real-time GPU Status: Automatically refreshes every 30 seconds
- Interactive Reservation Reports: Customizable time periods (1-90 days)
- Visual Design: Dark theme with color-coded status indicators
- Mobile Responsive: Works on desktop and mobile devices
- API Endpoints: 
  - /api/status - Current GPU status as JSON
  - /api/report?days=N - Usage report as JSON
Use Cases: - Team dashboards on shared displays - Remote monitoring without SSH access - Integration with monitoring systems via API - Mobile access for on-the-go checks
Command Interactions
Validation and Conflicts
All allocation commands (run and reserve) automatically:
- Scan for unreserved usage using nvidia-smi
- Exclude unreserved GPUs from the available pool
- Provide detailed error messages if insufficient GPUs remain
MRU-per-User Allocation
When multiple GPUs are available, the system uses Most Recently Used per User allocation:
- Prioritizes GPUs that you have used most recently (based on your usage history)
- Falls back to global LRU (Least Recently Used) for GPUs you haven't used
- Provides GPU affinity for better cache locality and workflow continuity
- Ensures fair distribution across all users while respecting individual preferences
Reservation Types
- Run-type reservations: Maintained by heartbeat, auto-released when process ends
- Manual reservations: Time-based expiry, require explicit release or timeout
Status Integration
The status command shows comprehensive information about all reservation types and validates actual usage against reservations, making it easy to identify and resolve conflicts.