GPU Validation
canhazgpu integrates with nvidia-smi to provide real-time validation of GPU usage, ensuring that reservations match actual resource utilization and detecting unreserved usage.
How Validation Works
nvidia-smi Integration
The system uses nvidia-smi to query actual GPU processes and memory usage:
# canhazgpu internally runs commands like:
nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits
nvidia-smi --query-compute-apps=pid,process_name,gpu_uuid,used_memory --format=csv,noheader
Process Owner Detection
For each GPU process, canhazgpu identifies the owner:
- Primary method: Read
/proc/{pid}/status
to get the UID - Fallback method: Use
ps -o uid= -p {pid}
if/proc
is unavailable - Username resolution: Convert UID to username using system user database
Memory Threshold
GPUs with more than the configured memory threshold are considered "in use" (default: 1GB = 1024 MB):
- Below threshold: Baseline GPU driver usage, considered available
- Above threshold: Active workload detected, GPU marked as in use
The threshold can be customized using the --memory-threshold
flag:
# Use a lower threshold (512 MB) to detect lighter GPU usage
canhazgpu status --memory-threshold 512
# Use a higher threshold (2 GB) to ignore smaller allocations
canhazgpu status --memory-threshold 2048
# Apply to allocation commands as well
canhazgpu run --memory-threshold 512 --gpus 1 -- python train.py
Validation Output
Status Display
The validation information appears in the VALIDATION column of the status output:
❯ canhazgpu status
GPU STATUS USER DURATION TYPE MODEL DETAILS VALIDATION
--- --------- -------- ----------- ------- ---------------- -------------------------- ---------------------
0 available free for 30m 45MB used
1 in use alice 15m 30s run llama-2-7b-chat heartbeat 5s ago 8452MB, 1 processes
2 in use bob WITHOUT RESERVATION 1024MB used by PID 12345 (python3)
3 in use charlie 1h 2m 15s manual expires in 3h 15m 45s no usage detected
Validation States
Normal Available GPU
- Low memory usage indicates GPU is available - Only driver baseline memory consumption - Safe to allocateConfirmed Reservation Usage
- High memory usage confirms GPU is actively used - Process count shows number of applications using GPU - Validates that reservation matches actual usageNo Usage Detected
- GPU is reserved but not actually being used - Could indicate: - Preparation phase (normal) - Finished work but reservation not released - Stale reservation (should be cleaned up)Unauthorized Usage Detail
- Shows specific user running unreserved processes - Lists PIDs and process names - Memory usage quantifies the unreserved resource consumptionValidation Benefits
Prevents Double Allocation
Without validation, you might have scenarios like: - GPU 1 is "AVAILABLE" according to Redis - But someone is actually using GPU 1 without reservation - New reservation could conflict with existing usage
With validation: - GPU 1 would show "IN USE WITHOUT RESERVATION" - GPU 1 is automatically excluded from allocation - Prevents conflicts and out-of-memory errors
User Accountability
GPU STATUS USER DURATION TYPE MODEL DETAILS VALIDATION
--- --------- -------- ----------- ------- ---------------- -------------------------- ---------------------
2 in use bob WITHOUT RESERVATION 1024MB used by PID 12345 (python3)
5 in use charlie WITHOUT RESERVATION 8GB used by PID 23456 (jupyter)
Clear identification of which users need to be contacted about policy compliance.
Resource Optimization
GPU STATUS USER DURATION TYPE MODEL DETAILS VALIDATION
--- --------- -------- ----------- ------- ---------------- -------------------------- ---------------------
3 in use alice 8h 0m 0s manual expires in 30m 0s no usage detected
Identifies stale reservations that could be released early to improve resource availability.
Validation in Allocation
Pre-Allocation Scanning
Before any GPU allocation, canhazgpu:
- Scans all GPUs using nvidia-smi
- Identifies unreserved usage and excludes those GPUs
- Updates available GPU pool with only truly available GPUs
- Proceeds with allocation using the validated pool
Error Messages with Context
❯ canhazgpu run --gpus 3 -- python train.py
Error: Not enough GPUs available. Requested: 3, Available: 1 (2 GPUs in use without reservation - run 'canhazgpu status' for details)
The error message indicates: - You requested 3 GPUs - Only 1 is actually available - 2 GPUs are being used without proper reservations - Suggests checking status for more details
Validation Edge Cases
Multiple Users Per GPU
GPU STATUS USER DURATION TYPE MODEL DETAILS VALIDATION
--- --------- --------------- ----------- ------- ---------------- -------------------------- ---------------------
4 in use alice,bob,charlie WITHOUT RESERVATION 2048MB used by PID 12345 (python3), PID 23456 (pytorch) and 2 more
When multiple users have processes on the same GPU: - All usernames are listed - Process details may be truncated for readability - Total memory usage is shown
Process Information Limitations
Sometimes process details may be limited:
GPU STATUS USER DURATION TYPE MODEL DETAILS VALIDATION
--- --------- -------- ----------- ------- ---------------- -------------------------- ---------------------
4 in use unknown WITHOUT RESERVATION 1024MB used by PID 12345 (unknown process)
This can happen when: - Process owner cannot be determined (permission issues) - Process name is not available - Process terminates between detection and information gathering
High Memory Baseline
Some systems may have higher baseline GPU memory usage due to: - Multiple GPU monitoring tools - Persistent CUDA contexts - Background ML services
The 1GB threshold can be adjusted if needed (currently hardcoded).
Validation Performance
Caching and Efficiency
- Validation runs only during allocation and status commands
- nvidia-smi queries are batched for efficiency
- Process information is gathered in parallel where possible
Impact on System Performance
- Minimal overhead: validation takes typically <1 second
- No continuous monitoring or background processes
- GPU validation doesn't interfere with actual GPU workloads
Integration with Other Features
LRU Allocation
Validation integrates with LRU (Least Recently Used) allocation: - Only validated available GPUs are considered for LRU ranking - Unauthorized GPUs are excluded from LRU pool - Last release timestamps are preserved across validation
Race Condition Protection
Validation is integrated into the atomic allocation process: - Redis Lua scripts receive unreserved GPU lists - Allocation fails if requested GPUs become unreserved during allocation - Ensures consistent state between validation and allocation
Heartbeat System
Validation complements the heartbeat system: - Heartbeat tracks reservation liveness - Validation tracks actual GPU usage - Together they provide complete resource lifecycle management
Troubleshooting Validation
nvidia-smi Not Available
Ensure NVIDIA drivers are properly installed:
# Test nvidia-smi availability
nvidia-smi
# If not available, install NVIDIA drivers
sudo apt install nvidia-driver-* # Ubuntu
Permission Issues
This may occur when:
- /proc
filesystem has restricted access
- User lacks permissions to query process information
- Process terminates between detection and query
The system will still function but with less detailed process information.
Memory Reporting Discrepancies
Different tools may report slightly different GPU memory usage: - nvidia-smi vs. CUDA runtime memory reports - Shared memory vs. process-specific memory - Memory allocated vs. memory actually used
canhazgpu uses nvidia-smi reporting for consistency across all processes.
GPU validation ensures that canhazgpu maintains accurate, real-time awareness of GPU resource utilization, preventing conflicts and enabling efficient resource sharing.