Status Monitoring
The status command provides real-time visibility into GPU allocation and usage across your system. It combines reservation tracking with actual GPU usage validation to give you a complete picture of resource utilization.
Basic Usage
# Table output (default)
canhazgpu status
# JSON output for programmatic use
canhazgpu status --json
canhazgpu status -j
No options are required for basic usage - the command automatically validates all GPUs and shows comprehensive status information in either table or JSON format.
Output Formats
Table Output (Default)
The default output format provides a human-readable table with aligned columns:
❯ canhazgpu status
GPU  STATUS      USER     DURATION     TYPE    MODEL                    DETAILS                  VALIDATION
---  ------      ----     --------     ----    -----                    -------                  ----------
0    AVAILABLE   -        -            -       -                        free for 0h 30m 15s     45MB used
1    IN_USE      alice    0h 15m 30s   RUN     meta-llama/Llama-2-7b-chat-hf  heartbeat 0h 0m 5s ago   8452MB, 1 processes
2    UNRESERVED  user bob -            -       mistralai/Mistral-7B-Instruct-v0.1    1024MB used by PID 12345 (python3), PID 67890 (jupyter)
3    IN_USE      charlie  1h 2m 15s    MANUAL  -                        expires in 3h 15m 45s   no usage detected
4    UNRESERVED  users alice, bob and charlie  -  -  meta-llama/Meta-Llama-3-8B-Instruct  2048MB used by PID 12345 (python3), PID 23456 (pytorch) and 2 more
JSON Output
For programmatic integration, use the --json or -j flag to get structured JSON output:
❯ canhazgpu status --json
[
  {
    "gpu_id": 0,
    "status": "AVAILABLE",
    "details": "free for 0h 30m 15s",
    "validation": "45MB used",
    "last_released": "2025-07-07T18:24:56.100193782Z"
  },
  {
    "gpu_id": 1,
    "status": "IN_USE",
    "user": "alice",
    "duration": "0h 15m 30s",
    "type": "RUN",
    "details": "heartbeat 0h 0m 5s ago",
    "validation": "8452MB, 1 processes",
    "model": {
      "provider": "meta-llama",
      "model": "meta-llama/Llama-2-7b-chat-hf"
    },
    "last_heartbeat": "2025-07-07T18:26:27.627148565Z"
  },
  {
    "gpu_id": 2,
    "status": "UNRESERVED",
    "details": "WITHOUT RESERVATION",
    "validation": "1024MB used",
    "unreserved_users": ["bob"],
    "process_info": "1024MB used by PID 12345 (python3), PID 67890 (jupyter)",
    "model": {
      "provider": "mistralai",
      "model": "mistralai/Mistral-7B-Instruct-v0.1"
    }
  },
  {
    "gpu_id": 3,
    "status": "IN_USE",
    "user": "charlie",
    "duration": "1h 2m 15s",
    "type": "MANUAL",
    "details": "expires in 3h 15m 45s",
    "validation": "no usage detected",
    "expiry_time": "2025-07-08T01:48:44Z"
  }
]
JSON Field Reference
| Field | Type | Description | 
|---|---|---|
| gpu_id | integer | GPU identifier (0, 1, 2, etc.) | 
| status | string | Current status: AVAILABLE,IN_USE,UNRESERVED,ERROR | 
| user | string | Username (if GPU is reserved) | 
| duration | string | How long the GPU has been reserved | 
| type | string | Reservation type: RUN,MANUAL | 
| details | string | Context-specific information | 
| validation | string | Memory usage and process information | 
| model | object | Detected AI model information | 
| model.provider | string | Model provider (e.g., "meta-llama", "openai") | 
| model.model | string | Full model identifier | 
| last_released | string | ISO timestamp when GPU was last released | 
| last_heartbeat | string | ISO timestamp of last heartbeat | 
| expiry_time | string | ISO timestamp when manual reservation expires | 
| unreserved_users | array | List of users with unreserved processes | 
| process_info | string | Process details for unreserved usage | 
| error | string | Error message (for ERROR status) | 
Status Information Explained
Status Types
AVAILABLE
- Meaning: GPU is free and can be allocated
- Time info: Shows how long it has been free (for LRU allocation)
- Validation: Shows current memory usage (usually low baseline usage)
IN USE (Proper Reservations)
Components:
- alice: Username who reserved the GPU
- 0h 15m 30s: How long it's been reserved
- RUN: Reservation type (RUN or MANUAL)
- meta-llama/Llama-2-7b-chat-hf: Detected AI model (if any)
- heartbeat 0h 0m 5s ago: Additional reservation info
- 8452MB, 1 processes: Actual usage validation
Reservation Types:
Run-type reservations:
1    IN_USE      alice    0h 15m 30s   RUN     meta-llama/Llama-2-7b-chat-hf  heartbeat 0h 0m 5s ago   8452MB, 1 processes
canhazgpu run command
- Maintained by periodic heartbeats
- Auto-released when process ends or heartbeat stops
- Shows heartbeat timing in DETAILS column
Manual reservations:
- Created bycanhazgpu reserve command- Time-based expiry - Must be manually released or will expire - Shows expiry timing in DETAILS column
UNRESERVED
2    UNRESERVED  user bob -            -       mistralai/Mistral-7B-Instruct-v0.1    1024MB used by PID 12345 (python3), PID 67890 (jupyter)  -
- Meaning: Someone is using the GPU without proper reservation
- User identification: Shows which user(s) are running unreserved processes
- Model detection: Shows detected AI model in MODEL column
- Process details: Lists PIDs and process names using the GPU in DETAILS column
- Impact: This GPU will be excluded from allocation until usage stops
Multiple unreserved users:
4    UNRESERVED  users alice, bob and charlie  -  -  meta-llama/Meta-Llama-3-8B-Instruct  2048MB used by PID 12345 (python3), PID 23456 (pytorch) and 2 more  -
Validation Information
The VALIDATION column shows actual GPU usage detected via nvidia-smi:
Confirms Proper Usage
- GPU is reserved and actually being used - Shows memory usage and process count - Indicates healthy, proper resource utilizationNo Usage Detected
- GPU is reserved but no processes are running - Might indicate: - Preparation phase before starting work - Finished work but reservation not yet released - Stale reservation that should be cleaned upBaseline Usage Only
- Available GPU with minimal background usage - Normal baseline memory usage from GPU drivers - Safe to allocateMonitoring Patterns
Regular Health Checks
# Quick status check
canhazgpu status
# Monitor changes over time
watch -n 30 canhazgpu status
# Log status for analysis
canhazgpu status >> gpu_usage_log.txt
Identifying Problems
Stale Reservations
- Long reservation with no actual usage - User likely forgot to release - Will auto-expire soonHeartbeat Issues
1    IN_USE      bob      2h 30m 0s    RUN     codellama/CodeLlama-7b-Instruct-hf        heartbeat 0h 5m 30s ago 8452MB, 1 processes
Unreserved Usage Patterns
2    UNRESERVED  user charlie  -       -       microsoft/DialoGPT-large     12GB used by PID 12345 (python3)      -
5    UNRESERVED  user charlie  -       -       NousResearch/Nous-Hermes-2-Yi-34B   8GB used by PID 23456 (jupyter)       -
Team Coordination
Planning Allocations
❯ canhazgpu status
GPU  STATUS     USER   DURATION   TYPE    MODEL            DETAILS                  VALIDATION
---  ------     ----   --------   ----    -----            -------                  ----------
0    AVAILABLE  -      -          -       -                free for 2h 0m 0s       1MB used     # Good candidate
1    AVAILABLE  -      -          -       -                free for 0h 30m 0s      1MB used     # Recently used
2    IN_USE     alice  0h 5m 0s   RUN     teknium/OpenHermes-2.5-Mistral-7B  heartbeat 0h 0m 3s ago   2048MB, 1 processes  # Just started
3    IN_USE     bob    3h 45m 0s  MANUAL  -                expires in 0h 15m 0s    no usage detected    # Expiring soon
From this, you can see: - GPU 0 is the best choice (LRU) - GPU 1 was used recently - GPU 3 will be available in 15 minutes
Resource Conflicts
❯ canhazgpu status
Available GPUs: 2 out of 8
In use with reservations: 4 GPUs  
Unreserved usage: 2 GPUs
Clear indication that unreserved usage is reducing available capacity.
Advanced Monitoring
Integration with Monitoring Systems
Prometheus/Grafana Integration
#!/bin/bash
# gpu_metrics.sh - Export metrics for monitoring
STATUS=$(canhazgpu status)
# Count GPU states
AVAILABLE=$(echo "$STATUS" | grep "AVAILABLE" | wc -l)
IN_USE=$(echo "$STATUS" | grep "IN USE by" | wc -l)
UNAUTHORIZED=$(echo "$STATUS" | grep "WITHOUT RESERVATION" | wc -l)
# Export metrics
echo "gpu_available $AVAILABLE"
echo "gpu_in_use $IN_USE"  
echo "gpu_unreserved $UNAUTHORIZED"
echo "gpu_total $((AVAILABLE + IN_USE + UNAUTHORIZED))"
Log Analysis
# Capture status with timestamps
while true; do
    echo "$(date): $(canhazgpu status)" >> gpu_monitoring.log
    sleep 300  # Every 5 minutes
done
# Analyze usage patterns
grep "AVAILABLE" gpu_monitoring.log | wc -l
grep "WITHOUT RESERVATION" gpu_monitoring.log | cut -d: -f2- | sort | uniq -c
Automated Alerts
Unreserved Usage Detection
#!/bin/bash
# unreserved_alert.sh
STATUS=$(canhazgpu status)
UNAUTHORIZED=$(echo "$STATUS" | grep "WITHOUT RESERVATION")
if [ -n "$UNAUTHORIZED" ]; then
    echo "ALERT: Unreserved GPU usage detected!"
    echo "$UNAUTHORIZED"
    # Send notification (customize as needed)
    echo "$UNAUTHORIZED" | mail -s "GPU Policy Violation" admin@company.com
fi
Capacity Monitoring
#!/bin/bash
# capacity_alert.sh
STATUS=$(canhazgpu status)
AVAILABLE=$(echo "$STATUS" | grep "AVAILABLE" | wc -l)
TOTAL=$(echo "$STATUS" | wc -l)
UTILIZATION=$((100 * (TOTAL - AVAILABLE) / TOTAL))
if [ $UTILIZATION -gt 90 ]; then
    echo "WARNING: GPU utilization at ${UTILIZATION}%"
    canhazgpu status
fi
Status Command Integration
Shell Scripts
#!/bin/bash
# wait_for_gpu.sh - Wait until GPUs become available
echo "Waiting for GPUs to become available..."
while true; do
    AVAILABLE=$(canhazgpu status | grep "AVAILABLE" | wc -l)
    if [ $AVAILABLE -ge 2 ]; then
        echo "GPUs available! Starting job..."
        canhazgpu run --gpus 2 -- python train.py
        break
    fi
    echo "Only $AVAILABLE GPUs available, waiting..."
    sleep 60
done
Python Integration
Using JSON Output (Recommended)
import subprocess
import json
import time
from datetime import datetime, timezone
def get_gpu_status():
    """Get GPU status as structured data using JSON output"""
    result = subprocess.run(['canhazgpu', 'status', '--json'], 
                          capture_output=True, text=True)
    if result.returncode != 0:
        raise RuntimeError(f"Status check failed: {result.stderr}")
    return json.loads(result.stdout)
def get_available_gpus():
    """Get list of available GPU IDs"""
    status = get_gpu_status()
    return [gpu['gpu_id'] for gpu in status if gpu['status'] == 'AVAILABLE']
def get_gpu_by_user(username):
    """Get GPUs reserved by a specific user"""
    status = get_gpu_status()
    return [gpu for gpu in status if gpu.get('user') == username]
def check_unreserved_usage():
    """Check for unreserved GPU usage"""
    status = get_gpu_status()
    unreserved = [gpu for gpu in status if gpu['status'] == 'UNRESERVED']
    if unreserved:
        print("WARNING: Unreserved GPU usage detected!")
        for gpu in unreserved:
            users = gpu.get('unreserved_users', [])
            process_info = gpu.get('process_info', 'Unknown processes')
            print(f"  GPU {gpu['gpu_id']}: Users {users} - {process_info}")
    return unreserved
def get_gpu_utilization():
    """Calculate GPU utilization statistics"""
    status = get_gpu_status()
    total = len(status)
    available = len([gpu for gpu in status if gpu['status'] == 'AVAILABLE'])
    in_use = len([gpu for gpu in status if gpu['status'] == 'IN_USE'])
    unreserved = len([gpu for gpu in status if gpu['status'] == 'UNRESERVED'])
    return {
        'total': total,
        'available': available,
        'in_use': in_use,
        'unreserved': unreserved,
        'utilization_percent': ((in_use + unreserved) / total) * 100
    }
# Usage examples
print("Available GPUs:", get_available_gpus())
print("Utilization:", get_gpu_utilization())
check_unreserved_usage()
Legacy Text Parsing
import subprocess
import re
import time
def get_gpu_status_legacy():
    """Parse canhazgpu status text output (legacy method)"""
    result = subprocess.run(['canhazgpu', 'status'], 
                          capture_output=True, text=True)
    if result.returncode != 0:
        raise RuntimeError(f"Status check failed: {result.stderr}")
    status = {}
    for line in result.stdout.strip().split('\n'):
        if line.startswith('GPU '):
            gpu_id = int(line.split(':')[0].split()[1])
            if 'AVAILABLE' in line:
                status[gpu_id] = 'available'
            elif 'WITHOUT RESERVATION' in line:
                status[gpu_id] = 'unreserved'  
            elif 'IN USE' in line:
                status[gpu_id] = 'reserved'
    return status
def wait_for_gpus(count=1, timeout=3600):
    """Wait for specified number of GPUs to become available"""
    start_time = time.time()
    while time.time() - start_time < timeout:
        status = get_gpu_status()
        available = sum(1 for state in status.values() if state == 'available')
        if available >= count:
            return True
        print(f"Waiting for {count} GPUs... ({available} currently available)")
        time.sleep(30)
    return False
# Usage
if wait_for_gpus(2):
    print("GPUs available! Starting training...")
    subprocess.run(['canhazgpu', 'run', '--gpus', '2', '--', 'python', 'train.py'])
else:
    print("Timeout waiting for GPUs")
The status command is your primary tool for understanding GPU resource utilization, identifying conflicts, and coordinating with your team. Regular monitoring helps maintain efficient resource usage and prevents conflicts.