Troubleshooting

This guide covers common issues and their solutions when using canhazgpu in production environments.

Redis Connection Issues

Redis Server Not Running

Symptoms:

❯ canhazgpu status
redis.exceptions.ConnectionError: Error 111 connecting to 127.0.0.1:6379. Connection refused.

Solutions:

# Check Redis status
sudo systemctl status redis-server

# Start Redis if not running
sudo systemctl start redis-server
sudo systemctl enable redis-server

# Verify Redis is accessible
redis-cli ping
# Should return: PONG

Alternative Redis installations:

# If using different Redis installation
ps aux | grep redis
netstat -tlnp | grep 6379

# Check Redis configuration
sudo cat /etc/redis/redis.conf | grep -E "^(bind|port)"

Redis Permission Issues

Symptoms:

❯ canhazgpu status
redis.exceptions.ResponseError: NOAUTH Authentication required.

Solutions:

# Check if Redis requires authentication
redis-cli
127.0.0.1:6379> ping
(error) NOAUTH Authentication required.

# Option 1: Disable AUTH for localhost (recommended for canhazgpu)
sudo vim /etc/redis/redis.conf
# Comment out: # requirepass your_password_here
sudo systemctl restart redis-server

# Option 2: Configure canhazgpu with AUTH (requires code modification)
# Currently not supported - disable AUTH instead

Redis Memory Issues

Symptoms:

❯ canhazgpu reserve
redis.exceptions.ResponseError: OOM command not allowed when used memory > 'maxmemory'.

Solutions:

# Check Redis memory usage
redis-cli info memory

# Increase maxmemory in /etc/redis/redis.conf
maxmemory 512mb  # Increase as needed

# Or disable maxmemory limit
# maxmemory 0

sudo systemctl restart redis-server

GPU Provider Issues

Wrong Provider Cached

Issue: System using wrong GPU provider after switching hardware

Solution:

# Check current cached provider
redis-cli get "canhazgpu:provider"

# Re-initialize with correct provider
canhazgpu admin --gpus 8 --provider nvidia --force
# OR
canhazgpu admin --gpus 8 --provider amd --force

# Let system auto-detect
canhazgpu admin --gpus 8 --force

Multiple GPU Vendors

Issue: System has both NVIDIA and AMD GPUs

Current Limitation: canhazgpu currently supports single provider per system

Workaround: Use the provider for the GPUs you want to manage:

# Use NVIDIA provider for NVIDIA GPUs
canhazgpu admin --gpus 4 --provider nvidia

# Use AMD provider for AMD GPUs  
canhazgpu admin --gpus 2 --provider amd

NVIDIA GPU Issues

nvidia-smi Not Available

Symptoms:

❯ canhazgpu status
nvidia-smi: command not found

Solutions:

# Check if NVIDIA drivers are installed
lspci | grep -i nvidia

# Install NVIDIA drivers (Ubuntu/Debian)
sudo apt update
sudo apt install nvidia-driver-470  # or latest version

# Install NVIDIA drivers (CentOS/RHEL/Fedora)
sudo dnf install nvidia-driver

# Verify installation
nvidia-smi

NVIDIA Driver Version Issues

Symptoms:

❯ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.

Solutions:

# Check driver status
sudo dmesg | grep nvidia
lsmod | grep nvidia

# Restart NVIDIA services
sudo systemctl restart nvidia-persistenced
sudo modprobe -r nvidia_uvm nvidia_drm nvidia_modeset nvidia
sudo modprobe nvidia

# If still failing, reinstall drivers
sudo apt purge nvidia-*
sudo apt autoremove
sudo apt install nvidia-driver-470
sudo reboot

NVIDIA GPU Detection Issues

Symptoms:

❯ nvidia-smi
No devices were found

Solutions:

# Check hardware detection
lspci | grep -i nvidia

# Check if GPUs are disabled in BIOS
# Reboot and check BIOS settings

# Check if GPUs are in compute mode
nvidia-smi -q -d COMPUTE

# Reset GPU state if needed
sudo nvidia-smi -r

AMD GPU Issues

amd-smi Not Available

Error:

amd-smi: command not found

Solution:

# Check if AMD GPUs are present
lspci | grep -i amd

# Test amd-smi availability
amd-smi list

# Install ROCm and amd-smi (Ubuntu/Debian)
sudo apt update
sudo apt install rocm-dev amd-smi-lib

# Install ROCm (CentOS/RHEL/Fedora)
sudo dnf install rocm-dev amd-smi-lib

Verify installation:

amd-smi list

AMD Driver Communication Error

Error:

❯ amd-smi list
Failed to initialize ROCm

Solution: Check ROCm installation and permissions:

# Check ROCm installation
ls /opt/rocm/

# Check user permissions
groups $USER
# Should include 'render' and 'video' groups

# Add user to groups if needed
sudo usermod -a -G render,video $USER
# Log out and log back in

# Restart ROCm services
sudo systemctl restart rocm-smi

Allocation Problems

Not Enough GPUs Available

Symptoms:

❯ canhazgpu run --gpus 2 -- python train.py
Error: Not enough GPUs available. Requested: 2, Available: 1 (1 GPUs in use without reservation - run 'canhazgpu status' for details)

Diagnosis:

# Check detailed status
canhazgpu status

# Look for unreserved usage
canhazgpu status | grep "UNRESERVED"

# Check actual GPU processes
nvidia-smi
amd-smi list

Solutions: 1. Contact unreserved users: - Identify users from status output - Ask them to use proper reservations

Wait for reservations to expire:
Check manual reservation expiry times
Wait for run-type reservations to complete

Reduce GPU request:

canhazgpu run --gpus 1 -- python train.py

Allocation Lock Timeouts

Symptoms:

❯ canhazgpu reserve --gpus 2
Error: Failed to acquire allocation lock after 5 attempts

Causes: - High contention (multiple users allocating simultaneously) - Stale locks from crashed processes - Redis performance issues

Solutions:

# Wait and retry
sleep 10
canhazgpu reserve --gpus 2

# Check for stale locks in Redis
redis-cli
127.0.0.1:6379> GET canhazgpu:allocation_lock
127.0.0.1:6379> DEL canhazgpu:allocation_lock  # If stale

# Check Redis performance
redis-cli --latency -i 1

GPU State Corruption

Symptoms:

❯ canhazgpu status
Error: GPU state corrupted for GPU 2

Solutions:

# Check Redis data
redis-cli
127.0.0.1:6379> GET canhazgpu:gpu:2
127.0.0.1:6379> DEL canhazgpu:gpu:2  # Clear corrupted state

# Reinitialize GPU pool if needed
canhazgpu admin --gpus 8 --force

# This will clear all reservations - warn users first

Process and Heartbeat Issues

Stale Heartbeats

Symptoms:

❯ canhazgpu status
GPU STATUS    USER     DURATION    TYPE    MODEL            DETAILS                    VALIDATION
--- --------- -------- ----------- ------- ---------------- -------------------------- ---------------------
1   in use    alice    3h 0m 0s    run                      heartbeat 15m 30s ago

Analysis: - Heartbeat should update every ~60 seconds - Heartbeats >5 minutes old indicate problems - GPU will auto-release after 15 minutes without heartbeat

Solutions:

# Check if process is still running
ps aux | grep alice | grep python

# If process died, wait for auto-cleanup (15 min timeout)
# If process is stuck, user should kill it

# Manual cleanup (admin only, if urgent)
redis-cli
127.0.0.1:6379> DEL canhazgpu:gpu:1

Orphaned Processes

Symptoms:

❯ canhazgpu status
GPU STATUS    USER     DURATION    TYPE    MODEL            DETAILS                    VALIDATION
--- --------- -------- ----------- ------- ---------------- -------------------------- ---------------------
0   available          free for 5m                                                    2048MB, 1 processes

GPU shows as available but has active processes
Usually indicates process started outside canhazgpu

Solutions:

# Identify the process
nvidia-smi
amd-smi list

# Contact process owner
ps -o user,pid,command -p <PID>

# Ask them to:
# 1. Stop the process, or
# 2. Create proper reservation

Permission and Access Issues

User Cannot Run canhazgpu

Symptoms:

❯ canhazgpu status
bash: canhazgpu: command not found

Solutions:

# Check if installed
which canhazgpu
ls -la /usr/local/bin/canhazgpu

# Check PATH
echo $PATH

# Install if missing
sudo cp canhazgpu /usr/local/bin/
sudo chmod +x /usr/local/bin/canhazgpu

Process Owner Detection Fails

Symptoms:

❯ canhazgpu status
GPU STATUS    USER     DURATION    TYPE    MODEL            DETAILS                    VALIDATION
--- --------- -------- ----------- ------- ---------------- -------------------------- ---------------------
2   in use    unknown                                       WITHOUT RESERVATION        1024MB used by PID 12345 (unknown process)

Solutions:

# Check /proc filesystem access
ls -la /proc/12345/

# Check ps command availability
ps -o user,command -p 12345

# If still failing, process may have terminated
# Wait for next status check

Redis Access Denied

Symptoms:

❯ canhazgpu status
redis.exceptions.ResponseError: DENIED Redis is running in protected mode

Solutions:

# Check Redis configuration
sudo cat /etc/redis/redis.conf | grep protected-mode

# Option 1: Disable protected mode (if Redis is localhost-only)
sudo vim /etc/redis/redis.conf
# Set: protected-mode no

# Option 2: Set bind address (recommended)
# Set: bind 127.0.0.1

sudo systemctl restart redis-server

Performance Issues

Slow Status Commands

Symptoms: - canhazgpu status takes >5 seconds - High latency in GPU allocation

Diagnosis:

# Time the status command
time canhazgpu status

# Check nvidia-smi or amd-smi performance
time nvidia-smi
time amd-smi list

# Check Redis performance
redis-cli --latency -i 1

Solutions:

# Optimize Redis
sudo vim /etc/redis/redis.conf
# Add: tcp-keepalive 60
# Add: timeout 0

# Check system resources
htop
iostat -x 1

# Check for disk I/O issues
sudo iotop

High Memory Usage

Symptoms: - System running out of memory - Redis using excessive memory

Solutions:

# Check Redis memory usage
redis-cli info memory

# Set memory limit in /etc/redis/redis.conf
maxmemory 256mb
maxmemory-policy allkeys-lru

# Monitor system memory
free -h
ps aux --sort=-%mem | head -20

Data Recovery

Lost GPU Reservations

Symptoms: - All GPUs show as available after system restart - Users lose their reservations

Recovery:

# Check if Redis data persisted
redis-cli
127.0.0.1:6379> KEYS canhazgpu:*

# If no data, check for Redis backup
ls -la /var/lib/redis/

# If backup exists, restore it
redis-cli FLUSHALL
cat dump.rdb | redis-cli --pipe

# If no backup, reinitialize
canhazgpu admin --gpus 8 --force

Corrupted Redis Database

Symptoms:

❯ canhazgpu status
redis.exceptions.ResponseError: WRONGTYPE Operation against a key holding the wrong kind of value

Recovery:

# Backup current state (if possible)
redis-cli --rdb /tmp/corrupted-backup.rdb

# Clear corrupted data
redis-cli FLUSHALL

# Reinitialize canhazgpu
canhazgpu admin --gpus 8

# Notify users about data loss

Preventive Measures

Regular Health Checks

#!/bin/bash
# /usr/local/bin/canhazgpu-healthcheck.sh

# Check all components
redis-cli ping > /dev/null || echo "ERROR: Redis down"
nvidia-smi > /dev/null || echo "ERROR: NVIDIA drivers down"
canhazgpu status > /dev/null || echo "ERROR: canhazgpu failing"

# Check for common issues
STALE_HEARTBEATS=$(canhazgpu status | grep "last heartbeat" | grep -E "[5-9]m|[1-9][0-9]m" | wc -l)
if [ $STALE_HEARTBEATS -gt 0 ]; then
    echo "WARNING: $STALE_HEARTBEATS stale heartbeats detected"
fi

UNAUTHORIZED=$(canhazgpu status | grep "WITHOUT RESERVATION" | wc -l)
if [ $UNAUTHORIZED -gt 0 ]; then
    echo "WARNING: $UNAUTHORIZED unreserved GPU usage detected"
fi

Monitoring Scripts

# /usr/local/bin/canhazgpu-monitor.sh
#!/bin/bash

while true; do
    echo "$(date): $(canhazgpu status | wc -l) GPUs, $(canhazgpu status | grep AVAILABLE | wc -l) available"
    sleep 300  # Every 5 minutes
done >> /var/log/canhazgpu-monitor.log

Backup Procedures

#!/bin/bash
# Daily Redis backup
redis-cli --rdb /backup/canhazgpu-$(date +%Y%m%d).rdb

# Keep last 7 days
find /backup/ -name "canhazgpu-*.rdb" -mtime +7 -delete

Getting Help

Debug Information Collection

When reporting issues, collect:

# System information
uname -a
lsb_release -a

# NVIDIA GPU information
nvidia-smi
lspci | grep -i nvidia

# AMD GPU information
amd-smi list
lspci | grep -i amd

# Redis information
redis-cli info
redis-cli config get '*'

# canhazgpu state
canhazgpu status
redis-cli keys 'canhazgpu:*'

# Process information
ps aux | grep -E "(redis|nvidia|python)"

Log Files to Check

/var/log/redis/redis-server.log
/var/log/syslog (or /var/log/messages)
dmesg output for hardware issues
Any custom monitoring logs

This troubleshooting guide covers the most common issues encountered in production deployments of canhazgpu. Most problems can be resolved by following these systematic approaches.