Unreserved Usage Detection

One of canhazgpu's key features is detecting and handling GPUs that are being used without proper reservations. This prevents resource conflicts and enforces fair resource sharing policies.

What is Unreserved Usage?

Unreserved usage occurs when: - A GPU has active processes consuming >1GB of memory - No proper reservation exists for that GPU in the system - The GPU usage was not coordinated through canhazgpu

Common Scenarios

# These create unreserved usage:
CUDA_VISIBLE_DEVICES=0,1 python train.py           # Direct GPU access
jupyter notebook                                    # Jupyter using default GPUs
docker run --gpus all pytorch/pytorch python       # Container with GPU access
python -c "import torch; torch.cuda.set_device(2)" # Explicit GPU selection

# These are proper usage:
canhazgpu run --gpus 2 -- python train.py          # Proper reservation
canhazgpu reserve --gpus 1 && jupyter notebook     # Manual reservation first

Detection Methods

Real-Time Scanning

canhazgpu detects unreserved usage through:

nvidia-smi Integration: Queries actual GPU processes and memory usage
Process Ownership Detection: Identifies which users are running processes
Memory Threshold: Considers GPUs with >1GB usage as "in use"
Cross-Reference: Compares actual usage against Redis reservation database

Detection Timing

Unreserved usage detection runs: - Before every allocation (run and reserve commands) - During status checks (status command) - Atomically during allocation (within Redis transactions)

Status Display

Single Unreserved User

GPU STATUS    USER     DURATION    TYPE    MODEL            DETAILS                    VALIDATION
--- --------- -------- ----------- ------- ---------------- -------------------------- ---------------------
2   in use    bob                          mistralai/Mistral-7B-Instruct-v0.1           WITHOUT RESERVATION        1024MB used by PID 12345 (python3), PID 67890 (jupyter)

Information shown: - USER: The user running unreserved processes (bob) - DETAILS: Shows "WITHOUT RESERVATION" status - VALIDATION: Total memory consumption and process details - PID 12345 (python3): Process ID and name - PID 67890 (jupyter): Additional processes (if any)

Multiple Unreserved Users

GPU STATUS    USER            DURATION    TYPE    MODEL            DETAILS                    VALIDATION
--- --------- --------------- ----------- ------- ---------------- -------------------------- ---------------------
3   in use    alice,bob,charlie                    meta-llama/Meta-Llama-3-8B-Instruct                    WITHOUT RESERVATION        2048MB used by PID 12345 (python3), PID 23456 (pytorch) and 2 more

Information shown: - USER: All users with processes on this GPU (alice,bob,charlie) - DETAILS: Shows "WITHOUT RESERVATION" status - VALIDATION: Total memory consumption and process details - and 2 more: Indicates additional processes (display truncated for readability)

Process Details

The system attempts to show: - Process ID (PID): Unique identifier for the process - Process name: Executable name (python3, jupyter, etc.) - User ownership: Which user launched the process

Impact on Allocation

Automatic Exclusion

When unreserved usage is detected:

Pre-allocation scan identifies GPUs with unreserved usage
Exclusion from available pool removes those GPUs from allocation candidates
LRU calculation operates only on truly available GPUs
Error reporting provides detailed feedback about unavailable resources

Error Messages

❯ canhazgpu run --gpus 3 -- python train.py
Error: Not enough GPUs available. Requested: 3, Available: 1 (2 GPUs in use without reservation - run 'canhazgpu status' for details)

Message breakdown: - Requested: 3: Number of GPUs you asked for - Available: 1: Number of GPUs actually available for allocation - 2 GPUs in use without reservation: Number of GPUs excluded due to unreserved usage - Suggests running status for detailed information

Handling Unreserved Usage

Investigation Steps

Check detailed status:
```
canhazgpu status
```

Identify unreserved users and processes:

GPU STATUS    USER     DURATION    TYPE    MODEL            DETAILS                    VALIDATION
--- --------- -------- ----------- ------- ---------------- -------------------------- ---------------------
2   in use    bob                          mistralai/Mistral-7B-Instruct-v0.1           WITHOUT RESERVATION        1024MB used by PID 12345 (python3), PID 67890 (jupyter)

Contact the user to coordinate proper usage
Verify process details if needed:
```
ps -p 12345 -o pid,ppid,user,command
```

Resolution Options

Option 1: User Stops Unreserved Usage

# User bob stops their processes
kill 12345 67890

# Or gracefully shuts down
# Ctrl+C in their terminal, close Jupyter, etc.

Option 2: User Creates Proper Reservation

# User bob creates a proper reservation
canhazgpu reserve --gpus 1 --duration 4h

# Then continues their work
# GPU will now show as properly reserved

Option 3: Wait for Completion

# Wait for unreserved processes to finish naturally
# Then GPU will become available again

User Education and Policies

Training Users

Help users understand proper GPU usage:

# Wrong way
CUDA_VISIBLE_DEVICES=0 python train.py

# Right way  
canhazgpu run --gpus 1 -- python train.py

Policy Enforcement Examples

Gentle Reminder

#!/bin/bash
# unreserved_reminder.sh
STATUS=$(canhazgpu status)
UNAUTHORIZED=$(echo "$STATUS" | grep "WITHOUT RESERVATION")

if [ -n "$UNAUTHORIZED" ]; then
    echo "Reminder: Please use canhazgpu for GPU reservations"
    echo "$UNAUTHORIZED"
fi

Automated Notification

#!/bin/bash
# unreserved_notify.sh
STATUS=$(canhazgpu status)
UNAUTHORIZED=$(echo "$STATUS" | grep "WITHOUT RESERVATION")

if [ -n "$UNAUTHORIZED" ]; then
    # Extract usernames and send notifications
    USERS=$(echo "$UNAUTHORIZED" | sed -n 's/.*by users\? \([^-]*\) -.*/\1/p')
    for USER in $USERS; do
        echo "Please use canhazgpu for GPU reservations" | wall -n "$USER"
    done
fi

Advanced Detection Scenarios

Multi-GPU Unreserved Usage

GPU STATUS    USER     DURATION    TYPE    MODEL            DETAILS                    VALIDATION
--- --------- -------- ----------- ------- ---------------- -------------------------- ---------------------
1   in use    alice                        NousResearch/Nous-Hermes-2-Yi-34B                         WITHOUT RESERVATION        2048MB used by PID 11111 (python3)
2   in use    alice                        NousResearch/Nous-Hermes-2-Yi-34B                         WITHOUT RESERVATION        2048MB used by PID 11111 (python3)
5   in use    alice                        NousResearch/Nous-Hermes-2-Yi-34B                         WITHOUT RESERVATION        2048MB used by PID 11111 (python3)

Same process using multiple GPUs - user should reserve all needed GPUs properly.

Mixed Usage Patterns

GPU STATUS    USER     DURATION    TYPE    MODEL            DETAILS                    VALIDATION
--- --------- -------- ----------- ------- ---------------- -------------------------- ---------------------
0   in use    bob      1h 30m 0s   run     microsoft/DialoGPT-large heartbeat 5s ago          8452MB, 1 processes
1   in use    alice                        teknium/OpenHermes-2.5-Mistral-7B                         WITHOUT RESERVATION        1024MB used by PID 22222 (jupyter)
2   available          free for 2h                                                    45MB used
3   in use    charlie  45m 0s      manual                   expires in 7h 15m 0s      no usage detected

Mix of proper usage, unreserved usage, and available GPUs.

Container-Based Unreserved Usage

GPU STATUS    USER        DURATION    TYPE    MODEL            DETAILS                    VALIDATION
--- --------- ----------- ----------- ------- ---------------- -------------------------- ---------------------
1   in use    root,alice                      codellama/CodeLlama-7b-Instruct-hf                      WITHOUT RESERVATION        3072MB used by PID 33333 (dockerd), PID 44444 (python3) and 1 more

Docker containers running with GPU access - users should coordinate container GPU usage through canhazgpu.

System Integration

Monitoring Integration

def check_unreserved_usage():
    """Check for unreserved GPU usage and return details"""
    result = subprocess.run(['canhazgpu', 'status'], 
                          capture_output=True, text=True)

    unreserved = []
    for line in result.stdout.split('\n'):
        if 'WITHOUT RESERVATION' in line:
            # Parse user and memory info
            match = re.search(r'by users? ([^-]+) - (\d+)MB', line)
            if match:
                users = match.group(1).strip()
                memory = int(match.group(2))
                unreserved.append({
                    'users': users,
                    'memory_mb': memory,
                    'raw_line': line
                })

    return unreserved

# Usage in monitoring
violations = check_unreserved_usage()
if violations:
    for v in violations:
        logger.warning(f"Unreserved GPU usage: {v['users']} using {v['memory_mb']}MB")

Automated Response

#!/bin/bash
# unreserved_response.sh

# Check for unreserved usage
UNAUTHORIZED=$(canhazgpu status | grep "WITHOUT RESERVATION")

if [ -n "$UNAUTHORIZED" ]; then
    # Log the violation
    echo "$(date): $UNAUTHORIZED" >> /var/log/gpu_violations.log

    # Extract and notify users
    echo "$UNAUTHORIZED" | while read -r line; do
        USERS=$(echo "$line" | sed -n 's/.*by users\? \([^-]*\) -.*/\1/p')
        for USER in $USERS; do
            # Send notification to user
            echo "GPU Policy Violation: Please use 'canhazgpu reserve' for GPU access" | \
                mail -s "GPU Usage Policy" "$USER@company.com"
        done
    done

    # Alert administrators
    echo "$UNAUTHORIZED" | mail -s "GPU Policy Violations Detected" admin@company.com
fi

Best Practices

For Users

Always use canhazgpu for GPU access
Check status first with canhazgpu status before starting work
Reserve appropriately - don't over-reserve, but don't skip reservations
Clean up - release manual reservations when done

For Administrators

Monitor regularly - set up automated checks for unreserved usage
Educate users - provide training on proper GPU reservation practices
Set clear policies - document expected GPU usage procedures
Respond quickly - address unreserved usage promptly to prevent conflicts

For System Integration

Integrate with job schedulers - ensure SLURM/PBS jobs use canhazgpu
Container orchestration - configure Kubernetes/Docker to respect reservations
Development tools - configure IDEs and notebooks to check for reservations

Unreserved usage detection ensures fair resource sharing and prevents the frustrating "GPU out of memory" errors that occur when multiple users unknowingly compete for the same resources.