Unreserved Usage Detection
One of canhazgpu's key features is detecting and handling GPUs that are being used without proper reservations. This prevents resource conflicts and enforces fair resource sharing policies.
What is Unreserved Usage?
Unreserved usage occurs when: - A GPU has active processes consuming >1GB of memory - No proper reservation exists for that GPU in the system - The GPU usage was not coordinated through canhazgpu
Common Scenarios
# These create unreserved usage:
CUDA_VISIBLE_DEVICES=0,1 python train.py           # Direct GPU access
jupyter notebook                                    # Jupyter using default GPUs
docker run --gpus all pytorch/pytorch python       # Container with GPU access
python -c "import torch; torch.cuda.set_device(2)" # Explicit GPU selection
# These are proper usage:
canhazgpu run --gpus 2 -- python train.py          # Proper reservation
canhazgpu reserve --gpus 1 && jupyter notebook     # Manual reservation first
Detection Methods
Real-Time Scanning
canhazgpu detects unreserved usage through:
- nvidia-smi Integration: Queries actual GPU processes and memory usage
- Process Ownership Detection: Identifies which users are running processes
- Memory Threshold: Considers GPUs with >1GB usage as "in use"
- Cross-Reference: Compares actual usage against Redis reservation database
Detection Timing
Unreserved usage detection runs:
- Before every allocation (run and reserve commands)
- During status checks (status command)
- Atomically during allocation (within Redis transactions)
Status Display
Single Unreserved User
GPU STATUS    USER     DURATION    TYPE    MODEL            DETAILS                    VALIDATION
--- --------- -------- ----------- ------- ---------------- -------------------------- ---------------------
2   in use    bob                          mistralai/Mistral-7B-Instruct-v0.1           WITHOUT RESERVATION        1024MB used by PID 12345 (python3), PID 67890 (jupyter)
Information shown:
- USER: The user running unreserved processes (bob)
- DETAILS: Shows "WITHOUT RESERVATION" status
- VALIDATION: Total memory consumption and process details
- PID 12345 (python3): Process ID and name
- PID 67890 (jupyter): Additional processes (if any)
Multiple Unreserved Users
GPU STATUS    USER            DURATION    TYPE    MODEL            DETAILS                    VALIDATION
--- --------- --------------- ----------- ------- ---------------- -------------------------- ---------------------
3   in use    alice,bob,charlie                    meta-llama/Meta-Llama-3-8B-Instruct                    WITHOUT RESERVATION        2048MB used by PID 12345 (python3), PID 23456 (pytorch) and 2 more
Information shown:
- USER: All users with processes on this GPU (alice,bob,charlie)
- DETAILS: Shows "WITHOUT RESERVATION" status
- VALIDATION: Total memory consumption and process details
- and 2 more: Indicates additional processes (display truncated for readability)
Process Details
The system attempts to show: - Process ID (PID): Unique identifier for the process - Process name: Executable name (python3, jupyter, etc.) - User ownership: Which user launched the process
Impact on Allocation
Automatic Exclusion
When unreserved usage is detected:
- Pre-allocation scan identifies GPUs with unreserved usage
- Exclusion from available pool removes those GPUs from allocation candidates
- LRU calculation operates only on truly available GPUs
- Error reporting provides detailed feedback about unavailable resources
Error Messages
❯ canhazgpu run --gpus 3 -- python train.py
Error: Not enough GPUs available. Requested: 3, Available: 1 (2 GPUs in use without reservation - run 'canhazgpu status' for details)
Message breakdown:
- Requested: 3: Number of GPUs you asked for
- Available: 1: Number of GPUs actually available for allocation
- 2 GPUs in use without reservation: Number of GPUs excluded due to unreserved usage
- Suggests running status for detailed information
Handling Unreserved Usage
Investigation Steps
- 
Check detailed status: 
- 
Identify unreserved users and processes: 
- 
Contact the user to coordinate proper usage 
- 
Verify process details if needed: 
Resolution Options
Option 1: User Stops Unreserved Usage
# User bob stops their processes
kill 12345 67890
# Or gracefully shuts down
# Ctrl+C in their terminal, close Jupyter, etc.
Option 2: User Creates Proper Reservation
# User bob creates a proper reservation
canhazgpu reserve --gpus 1 --duration 4h
# Then continues their work
# GPU will now show as properly reserved
Option 3: Wait for Completion
User Education and Policies
Training Users
Help users understand proper GPU usage:
# Wrong way
CUDA_VISIBLE_DEVICES=0 python train.py
# Right way  
canhazgpu run --gpus 1 -- python train.py
Policy Enforcement Examples
Gentle Reminder
#!/bin/bash
# unreserved_reminder.sh
STATUS=$(canhazgpu status)
UNAUTHORIZED=$(echo "$STATUS" | grep "WITHOUT RESERVATION")
if [ -n "$UNAUTHORIZED" ]; then
    echo "Reminder: Please use canhazgpu for GPU reservations"
    echo "$UNAUTHORIZED"
fi
Automated Notification
#!/bin/bash
# unreserved_notify.sh
STATUS=$(canhazgpu status)
UNAUTHORIZED=$(echo "$STATUS" | grep "WITHOUT RESERVATION")
if [ -n "$UNAUTHORIZED" ]; then
    # Extract usernames and send notifications
    USERS=$(echo "$UNAUTHORIZED" | sed -n 's/.*by users\? \([^-]*\) -.*/\1/p')
    for USER in $USERS; do
        echo "Please use canhazgpu for GPU reservations" | wall -n "$USER"
    done
fi
Advanced Detection Scenarios
Multi-GPU Unreserved Usage
GPU STATUS    USER     DURATION    TYPE    MODEL            DETAILS                    VALIDATION
--- --------- -------- ----------- ------- ---------------- -------------------------- ---------------------
1   in use    alice                        NousResearch/Nous-Hermes-2-Yi-34B                         WITHOUT RESERVATION        2048MB used by PID 11111 (python3)
2   in use    alice                        NousResearch/Nous-Hermes-2-Yi-34B                         WITHOUT RESERVATION        2048MB used by PID 11111 (python3)
5   in use    alice                        NousResearch/Nous-Hermes-2-Yi-34B                         WITHOUT RESERVATION        2048MB used by PID 11111 (python3)
Same process using multiple GPUs - user should reserve all needed GPUs properly.
Mixed Usage Patterns
GPU STATUS    USER     DURATION    TYPE    MODEL            DETAILS                    VALIDATION
--- --------- -------- ----------- ------- ---------------- -------------------------- ---------------------
0   in use    bob      1h 30m 0s   run     microsoft/DialoGPT-large heartbeat 5s ago          8452MB, 1 processes
1   in use    alice                        teknium/OpenHermes-2.5-Mistral-7B                         WITHOUT RESERVATION        1024MB used by PID 22222 (jupyter)
2   available          free for 2h                                                    45MB used
3   in use    charlie  45m 0s      manual                   expires in 7h 15m 0s      no usage detected
Mix of proper usage, unreserved usage, and available GPUs.
Container-Based Unreserved Usage
GPU STATUS    USER        DURATION    TYPE    MODEL            DETAILS                    VALIDATION
--- --------- ----------- ----------- ------- ---------------- -------------------------- ---------------------
1   in use    root,alice                      codellama/CodeLlama-7b-Instruct-hf                      WITHOUT RESERVATION        3072MB used by PID 33333 (dockerd), PID 44444 (python3) and 1 more
Docker containers running with GPU access - users should coordinate container GPU usage through canhazgpu.
System Integration
Monitoring Integration
def check_unreserved_usage():
    """Check for unreserved GPU usage and return details"""
    result = subprocess.run(['canhazgpu', 'status'], 
                          capture_output=True, text=True)
    unreserved = []
    for line in result.stdout.split('\n'):
        if 'WITHOUT RESERVATION' in line:
            # Parse user and memory info
            match = re.search(r'by users? ([^-]+) - (\d+)MB', line)
            if match:
                users = match.group(1).strip()
                memory = int(match.group(2))
                unreserved.append({
                    'users': users,
                    'memory_mb': memory,
                    'raw_line': line
                })
    return unreserved
# Usage in monitoring
violations = check_unreserved_usage()
if violations:
    for v in violations:
        logger.warning(f"Unreserved GPU usage: {v['users']} using {v['memory_mb']}MB")
Automated Response
#!/bin/bash
# unreserved_response.sh
# Check for unreserved usage
UNAUTHORIZED=$(canhazgpu status | grep "WITHOUT RESERVATION")
if [ -n "$UNAUTHORIZED" ]; then
    # Log the violation
    echo "$(date): $UNAUTHORIZED" >> /var/log/gpu_violations.log
    # Extract and notify users
    echo "$UNAUTHORIZED" | while read -r line; do
        USERS=$(echo "$line" | sed -n 's/.*by users\? \([^-]*\) -.*/\1/p')
        for USER in $USERS; do
            # Send notification to user
            echo "GPU Policy Violation: Please use 'canhazgpu reserve' for GPU access" | \
                mail -s "GPU Usage Policy" "$USER@company.com"
        done
    done
    # Alert administrators
    echo "$UNAUTHORIZED" | mail -s "GPU Policy Violations Detected" admin@company.com
fi
Best Practices
For Users
- Always use canhazgpu for GPU access
- Check status first with canhazgpu statusbefore starting work
- Reserve appropriately - don't over-reserve, but don't skip reservations
- Clean up - release manual reservations when done
For Administrators
- Monitor regularly - set up automated checks for unreserved usage
- Educate users - provide training on proper GPU reservation practices
- Set clear policies - document expected GPU usage procedures
- Respond quickly - address unreserved usage promptly to prevent conflicts
For System Integration
- Integrate with job schedulers - ensure SLURM/PBS jobs use canhazgpu
- Container orchestration - configure Kubernetes/Docker to respect reservations
- Development tools - configure IDEs and notebooks to check for reservations
Unreserved usage detection ensures fair resource sharing and prevents the frustrating "GPU out of memory" errors that occur when multiple users unknowingly compete for the same resources.