Commands Overview
canhazgpu provides seven main commands for GPU management:
❯ canhazgpu --help
Usage: canhazgpu [OPTIONS] COMMAND [ARGS]...
Commands:
admin Initialize GPU pool for this machine
release Release manually reserved GPUs held by the current user
report Generate GPU usage reports
reserve Reserve GPUs manually for a specified duration
run Reserve GPUs and run a command with CUDA_VISIBLE_DEVICES set
status Show current GPU allocation status
web Start a web server for GPU status monitoring
Global Flags
All commands support these global configuration flags:
--config
: Path to configuration file (default:$HOME/.canhazgpu.yaml
)--redis-host
: Redis server hostname (default: localhost)--redis-port
: Redis server port (default: 6379)--redis-db
: Redis database number (default: 0)--memory-threshold
: Memory threshold in MB to consider a GPU as "in use" (default: 1024)
Configuration Methods:
- Command-line flags (highest priority)
- Environment variables with
CANHAZGPU_
prefix - Configuration files in YAML, JSON, or TOML format
- Built-in defaults (lowest priority)
Examples:
# Use command-line flags
canhazgpu status --redis-host redis.example.com --redis-port 6380
# Use environment variables
export CANHAZGPU_MEMORY_THRESHOLD=512
export CANHAZGPU_REDIS_HOST=redis.example.com
canhazgpu status
# Use a configuration file
canhazgpu --config /path/to/config.yaml status
canhazgpu --config config.json run --gpus 2 -- python train.py
Configuration File Examples:
YAML format (~/.canhazgpu.yaml
):
JSON format:
TOML format:
The --memory-threshold
flag controls when a GPU is considered "in use without reservation". GPUs using more than this amount of memory will be excluded from allocation and flagged as unreserved usage.
admin
Initialize and configure the GPU pool.
Options:
- --gpus
: Number of GPUs available on this machine (required)
- --force
: Force reinitialization even if already initialized
Examples:
# Initial setup
canhazgpu admin --gpus 8
# Change GPU count (requires --force)
canhazgpu admin --gpus 4 --force
Destructive Operation
Using --force
will clear all existing reservations. Use with caution in production.
status
Show current GPU allocation status with automatic validation.
# Table output (default)
canhazgpu status
# JSON output for programmatic use
canhazgpu status --json
canhazgpu status -j
Options:
- -j, --json
: Output status as JSON array instead of table format
Examples:
# Standard status check
canhazgpu status
# JSON output for scripts and APIs
canhazgpu status --json
# Use a lower threshold to detect lighter GPU usage
canhazgpu status --memory-threshold 512
# Use a higher threshold to ignore small allocations
canhazgpu status --memory-threshold 2048
# Combine JSON with memory threshold
canhazgpu status --json --memory-threshold 512
Global Memory Threshold
The --memory-threshold
flag is a global option that affects GPU usage detection across all commands. It can be set in your configuration file or used with any command that performs GPU validation.
Table Output Example:
GPU STATUS USER DURATION TYPE MODEL DETAILS VALIDATION
--- ------ ---- -------- ---- ----- ------- ----------
0 AVAILABLE - - - - free for 0h 30m 15s 45MB used
1 IN_USE alice 0h 15m 30s RUN meta-llama/Llama-2-7b-chat-hf heartbeat 0h 0m 5s ago 8452MB, 1 processes
2 UNRESERVED user bob - - codellama/CodeLlama-7b-Instruct-hf 1024MB used by PID 12345 (python3), PID 67890 (jupyter) -
3 IN_USE charlie 1h 2m 15s MANUAL - expires in 3h 15m 45s no usage detected
JSON Output Example:
[
{
"gpu_id": 0,
"status": "AVAILABLE",
"details": "free for 0h 30m 15s",
"validation": "45MB used"
},
{
"gpu_id": 1,
"status": "IN_USE",
"user": "alice",
"duration": "0h 15m 30s",
"type": "RUN",
"details": "heartbeat 0h 0m 5s ago",
"validation": "8452MB, 1 processes",
"model": {
"provider": "meta-llama",
"model": "meta-llama/Llama-2-7b-chat-hf"
}
},
{
"gpu_id": 2,
"status": "UNRESERVED",
"details": "WITHOUT RESERVATION",
"unreserved_users": ["bob"],
"process_info": "1024MB used by PID 12345 (python3), PID 67890 (jupyter)",
"model": {
"provider": "codellama",
"model": "codellama/CodeLlama-7b-Instruct-hf"
}
},
{
"gpu_id": 3,
"status": "IN_USE",
"user": "charlie",
"duration": "1h 2m 15s",
"type": "MANUAL",
"details": "expires in 3h 15m 45s",
"validation": "no usage detected"
}
]
Status Types:
- AVAILABLE
: GPU is free to use
- IN_USE
: GPU is properly reserved
- UNRESERVED
: GPU is being used without proper reservation
Table Columns:
- GPU
: GPU ID number
- STATUS
: Current state (AVAILABLE, IN_USE, UNRESERVED, ERROR)
- USER
: Username who reserved the GPU (or who is using it unreserved)
- DURATION
: How long the GPU has been reserved
- TYPE
: Reservation type (RUN, MANUAL)
- MODEL
: Detected AI model (if any)
- DETAILS
: Additional information (heartbeat, expiry, process info)
- VALIDATION
: Actual GPU usage validation (memory, process count)
run
Reserve GPUs and run a command with automatic cleanup.
Options:
- --gpus
: Number of GPUs to reserve (default: 1)
- --gpu-ids
: Specific GPU IDs to reserve (comma-separated, e.g., 1,3,5)
- --timeout
: Maximum time to run command before killing it (default: none)
GPU Selection Options
You can use --gpus
alone, --gpu-ids
alone, or both together if:
- --gpus
matches the number of GPU IDs specified, or
- --gpus
is 1 (the default value)
If specific GPU IDs are requested and any are not available, the entire reservation will fail.
Timeout formats:
- 30s
(30 seconds)
- 30m
(30 minutes)
- 2h
(2 hours)
- 1d
(1 day)
- 0.5h
(30 minutes with decimal)
Examples:
# Single GPU training
canhazgpu run --gpus 1 -- python train.py
# Multi-GPU distributed training
canhazgpu run --gpus 2 -- python -m torch.distributed.launch train.py
# Reserve specific GPU IDs
canhazgpu run --gpu-ids 1,3 -- python train.py
# Complex command with arguments
canhazgpu run --gpus 1 -- python train.py --batch-size 32 --epochs 100
# Training with timeout to prevent runaway processes
canhazgpu run --gpus 1 --timeout 2h -- python train.py
# Short timeout for testing
canhazgpu run --gpus 1 --timeout 30m -- python test_model.py
Behavior:
1. Validates actual GPU availability using nvidia-smi
2. Excludes GPUs that are in use without reservation
3. Reserves the requested number of GPUs using LRU allocation
4. Sets CUDA_VISIBLE_DEVICES
to the allocated GPU IDs
5. Runs your command
6. Automatically releases GPUs when the command finishes
7. Maintains a heartbeat while running to keep the reservation active
Error Handling:
❯ canhazgpu run --gpus 2 -- python train.py
Error: Not enough GPUs available. Requested: 2, Available: 1 (1 GPUs in use without reservation - run 'canhazgpu status' for details)
reserve
Manually reserve GPUs for a specified duration.
Options:
- --gpus
: Number of GPUs to reserve (default: 1)
- --gpu-ids
: Specific GPU IDs to reserve (comma-separated, e.g., 1,3,5)
- --duration
: Duration to reserve GPUs (default: 8h)
GPU Selection Options
You can use --gpus
alone, --gpu-ids
alone, or both together if:
- --gpus
matches the number of GPU IDs specified, or
- --gpus
is 1 (the default value)
If specific GPU IDs are requested and any are not available, the entire reservation will fail.
Duration Formats:
- 30m
: 30 minutes
- 2h
: 2 hours
- 1d
: 1 day
- 0.5h
: 30 minutes (decimal values supported)
Examples:
# Reserve 1 GPU for 8 hours (default)
canhazgpu reserve
# Reserve 2 GPUs for 4 hours
canhazgpu reserve --gpus 2 --duration 4h
# Reserve specific GPU IDs
canhazgpu reserve --gpu-ids 0,2 --duration 2h
# Reserve 1 GPU for 30 minutes
canhazgpu reserve --duration 30m
# Reserve 1 GPU for 2 days
canhazgpu reserve --gpus 1 --duration 2d
Important Note:
Unlike the run
command, reserve
does NOT automatically set CUDA_VISIBLE_DEVICES
. You must manually set it based on the GPU IDs shown in the output.
Use Cases: - Interactive development sessions - Jupyter notebook workflows - Preparing for batch jobs - Blocking GPUs for maintenance
release
Release manually reserved GPUs held by the current user.
Options:
- -G, --gpu-ids
: Specific GPU IDs to release (comma-separated, e.g., 1,3,5)
Examples:
# Release all manually reserved GPUs
❯ canhazgpu release
Released 2 GPU(s): [1, 3]
# Release specific GPUs
❯ canhazgpu release --gpu-ids 1,3
Released 2 GPU(s): [1, 3]
❯ canhazgpu release
No manually reserved GPUs found for current user
Scope
By default, releases all manually reserved GPUs. With --gpu-ids
, can release specific GPUs including both manual reservations (from reserve
command) and run-type reservations (from run
command).
report
Generate GPU reservation reports showing historical reservation patterns by user.
Options:
- --days
: Number of days to include in the report (default: 30)
Examples:
# Show reservations for the last 30 days (default)
canhazgpu report
# Show reservations for the last 7 days
canhazgpu report --days 7
# Show reservations for the last 24 hours
canhazgpu report --days 1
Example Output:
=== GPU Reservation Report ===
Period: 2025-05-31 to 2025-06-30 (30 days)
User GPU Hours Percentage Run Manual
---------------------------------------------------------------------------
alice 24.50 55.2% 12 8
bob 15.25 34.4% 6 15
charlie 4.60 10.4% 3 2
---------------------------------------------------------------------------
TOTAL 44.35 100.0% 21 25
Total reservations: 46
Unique users: 3
Report Features: - Shows GPU hours consumed by each user - Percentage of total usage - Breakdown by reservation type (run vs manual) - Total statistics for the period - Includes both completed and in-progress reservations
web
Start a web server providing a dashboard for real-time monitoring and reports.
Options:
- --port
: Port to run the web server on (default: 8080)
- --host
: Host to bind the web server to (default: 0.0.0.0)
Examples:
# Start web server on default port 8080
canhazgpu web
# Start on a custom port
canhazgpu web --port 3000
# Bind to localhost only
canhazgpu web --host 127.0.0.1 --port 8080
# Run on a specific interface
canhazgpu web --host 192.168.1.100 --port 8888
The dashboard displays: - System hostname in the header for easy identification - GPU cards showing status, user, duration, and validation info - Color-coded status badges (green=available, blue=in use, red=unreserved) - Reservation report with usage statistics and visual bars - Quick links to documentation and GitHub repository
Dashboard Features:
- Real-time GPU Status: Automatically refreshes every 30 seconds
- Interactive Reservation Reports: Customizable time periods (1-90 days)
- Visual Design: Dark theme with color-coded status indicators
- Mobile Responsive: Works on desktop and mobile devices
- API Endpoints:
- /api/status
- Current GPU status as JSON
- /api/report?days=N
- Usage report as JSON
Use Cases: - Team dashboards on shared displays - Remote monitoring without SSH access - Integration with monitoring systems via API - Mobile access for on-the-go checks
Command Interactions
Validation and Conflicts
All allocation commands (run
and reserve
) automatically:
- Scan for unreserved usage using nvidia-smi
- Exclude unreserved GPUs from the available pool
- Provide detailed error messages if insufficient GPUs remain
LRU Allocation
When multiple GPUs are available, the system uses Least Recently Used allocation:
- GPUs that were released longest ago are allocated first
- Ensures fair distribution of GPU usage over time
- Helps with thermal management and hardware wear leveling
Reservation Types
- Run-type reservations: Maintained by heartbeat, auto-released when process ends
- Manual reservations: Time-based expiry, require explicit release or timeout
Status Integration
The status
command shows comprehensive information about all reservation types and validates actual usage against reservations, making it easy to identify and resolve conflicts.