Operations

Monitoring

Container health checks, log aggregation, and system monitoring.

Uptime Kuma (Service Monitoring)

Uptime Kuma runs at http://{IP}:3001. It monitors all 15 containers via HTTP, TCP, and Docker health checks.

First-time setup:

  1. Open http://{IP}:3001 — create an admin account
  2. Add a new monitor for each service:
    • Type: HTTP for API services, Docker for containers
    • Interval: 60 seconds (recommended)
    • Retention: Keep defaults

Recommended monitors:

MonitorTypeTarget
Laravel APIHTTPhttp://ubios_api:8000/api/v1/health
AgnoHTTPhttp://ubios_agno:8001/health
PostgreSQLDockerubios_postgres
RedisTCPubios_redis:6379
MinIOHTTPhttp://ubios-minio:9000/minio/health/live
MetabaseHTTPhttp://ubios_metabase:3000/api/health
DifyHTTPhttp://ubios_dify_api:5001/health
LiteLLM ProxyHTTPhttp://ubios_litellm:4000/health

Dozzle (Log Viewer)

Dozzle runs at http://{IP}:8080. It shows real-time logs for all containers with zero configuration.

Features:

  • View logs for any container by clicking its name
  • Search logs with regex or SQL
  • Split-screen to watch multiple containers
  • Filter by severity level

Container Health Checks

Check all containers are running:

docker ps --format "table {{.Names}}\t{{.Status}}"

Expected: all 15 containers show Up with (healthy) for services that define health checks (postgres, minio).

Individual Health Endpoints

ServiceCommandExpected
Agnocurl http://$HOST_IP:8001/health{"status": "ok"}
LiteLLM Proxycurl http://$HOST_IP:4000/health{"status": "ok"}
PostgreSQLdocker inspect ubios_postgres --format='{{.State.Health.Status}}'healthy
MinIOcurl -sf http://$HOST_IP:9000/minio/health/live200 OK

Redis Monitoring

# Check Redis is responding
docker exec -it ubios_redis redis-cli ping
# Expected: PONG

# Memory usage
docker exec -it ubios_redis redis-cli info memory | grep used_memory_human

# Connected clients
docker exec -it ubios_redis redis-cli info clients | grep connected_clients

# Cache hit rate
docker exec -it ubios_redis redis-cli info stats | grep keyspace_hits

Queue Monitoring

# Check queue worker is running
docker compose -f docker-compose.ip-test.yml logs --tail=20 queue

# Failed jobs
docker exec -it ubios_api php artisan queue:failed

# Retry a failed job
docker exec -it ubios_api php artisan queue:retry {job_id}

Agent Activity

# Recent agent outputs
docker exec -it ubios_api bash -c '
  PGPASSWORD=331331331 psql -U postgres -h ubios_postgres -d ubios -c \
    "SELECT agent_name, output_type, severity, is_read, created_at
     FROM ubios_agent_state.agent_outputs
     ORDER BY created_at DESC LIMIT 10;"
'

# Scheduled job status
docker exec -it ubios_api bash -c '
  PGPASSWORD=331331331 psql -U postgres -h ubios_postgres -d ubios -c \
    "SELECT * FROM ubios_agent_state.scheduled_jobs WHERE is_active = true;"
'

Log Aggregation

All services log to stdout/stderr (Docker logging driver). View them via Dozzle (http://{IP}:8080) or the CLI:

# Follow all logs with timestamps
docker compose -f docker-compose.ip-test.yml logs -f -t

# Filter by severity (Agno)
docker compose -f docker-compose.ip-test.yml logs -f agno 2>&1 | grep -i error

# Laravel error log
docker exec -it ubios_api tail -f storage/logs/laravel.log

Key Metrics to Watch

MetricWarningCriticalWhere to check
PostgreSQL connections> 80% pool> 95%pg_stat_activity
Redis memory> 80% maxmemory> 95%redis-cli info memory
Queue depth> 100 jobs> 500 jobsqueue:failed + Redis list length
Agno response time> 5s P95> 10s P95GET /health + application logs
Disk usage> 80%> 90%df -h on host

Production Monitoring (Target)

Phase 3 includes centralized monitoring via Prometheus + Grafana on a control plane server:

  • Container health status (all 12+ containers)
  • PostgreSQL connection pool saturation and query latency
  • Redis memory usage and eviction rate
  • Agent session count, LLM call count, average duration
  • Query cache hit rate
  • Scheduled job success/failure rate
  • Document extraction queue depth
  • Disk usage (Postgres data, MinIO objects)