Monitoring

Uptime Kuma (Service Monitoring)

Uptime Kuma runs at http://{IP}:3001. It monitors all 15 containers via HTTP, TCP, and Docker health checks.

First-time setup:

Open http://{IP}:3001 — create an admin account
Add a new monitor for each service:
- Type: HTTP for API services, Docker for containers
- Interval: 60 seconds (recommended)
- Retention: Keep defaults

Recommended monitors:

Monitor	Type	Target
Laravel API	HTTP	`http://ubios_api:8000/api/v1/health`
Agno	HTTP	`http://ubios_agno:8001/health`
PostgreSQL	Docker	`ubios_postgres`
Redis	TCP	`ubios_redis:6379`
MinIO	HTTP	`http://ubios-minio:9000/minio/health/live`
Metabase	HTTP	`http://ubios_metabase:3000/api/health`
Dify	HTTP	`http://ubios_dify_api:5001/health`
LiteLLM Proxy	HTTP	`http://ubios_litellm:4000/health`

Dozzle (Log Viewer)

Dozzle runs at http://{IP}:8080. It shows real-time logs for all containers with zero configuration.

Features:

View logs for any container by clicking its name
Search logs with regex or SQL
Split-screen to watch multiple containers
Filter by severity level

Container Health Checks

Check all containers are running:

docker ps --format "table {{.Names}}\t{{.Status}}"

Expected: all 15 containers show Up with (healthy) for services that define health checks (postgres, minio).

Individual Health Endpoints

Service	Command	Expected
Agno	`curl http://$HOST_IP:8001/health`	`{"status": "ok"}`
LiteLLM Proxy	`curl http://$HOST_IP:4000/health`	`{"status": "ok"}`
PostgreSQL	`docker inspect ubios_postgres --format='{{.State.Health.Status}}'`	`healthy`
MinIO	`curl -sf http://$HOST_IP:9000/minio/health/live`	200 OK

Redis Monitoring

# Check Redis is responding
docker exec -it ubios_redis redis-cli ping
# Expected: PONG

# Memory usage
docker exec -it ubios_redis redis-cli info memory | grep used_memory_human

# Connected clients
docker exec -it ubios_redis redis-cli info clients | grep connected_clients

# Cache hit rate
docker exec -it ubios_redis redis-cli info stats | grep keyspace_hits

Queue Monitoring

# Check queue worker is running
docker compose -f docker-compose.ip-test.yml logs --tail=20 queue

# Failed jobs
docker exec -it ubios_api php artisan queue:failed

# Retry a failed job
docker exec -it ubios_api php artisan queue:retry {job_id}

Agent Activity

# Recent agent outputs
docker exec -it ubios_api bash -c '
  PGPASSWORD=331331331 psql -U postgres -h ubios_postgres -d ubios -c \
    "SELECT agent_name, output_type, severity, is_read, created_at
     FROM ubios_agent_state.agent_outputs
     ORDER BY created_at DESC LIMIT 10;"
'

# Scheduled job status
docker exec -it ubios_api bash -c '
  PGPASSWORD=331331331 psql -U postgres -h ubios_postgres -d ubios -c \
    "SELECT * FROM ubios_agent_state.scheduled_jobs WHERE is_active = true;"
'

Log Aggregation

All services log to stdout/stderr (Docker logging driver). View them via Dozzle (http://{IP}:8080) or the CLI:

# Follow all logs with timestamps
docker compose -f docker-compose.ip-test.yml logs -f -t

# Filter by severity (Agno)
docker compose -f docker-compose.ip-test.yml logs -f agno 2>&1 | grep -i error

# Laravel error log
docker exec -it ubios_api tail -f storage/logs/laravel.log

Key Metrics to Watch

Metric	Warning	Critical	Where to check
PostgreSQL connections	> 80% pool	> 95%	`pg_stat_activity`
Redis memory	> 80% maxmemory	> 95%	`redis-cli info memory`
Queue depth	> 100 jobs	> 500 jobs	`queue:failed` + Redis list length
Agno response time	> 5s P95	> 10s P95	`GET /health` + application logs
Disk usage	> 80%	> 90%	`df -h` on host

Production Monitoring (Target)

Phase 3 includes centralized monitoring via Prometheus + Grafana on a control plane server:

Container health status (all 12+ containers)
PostgreSQL connection pool saturation and query latency
Redis memory usage and eviction rate
Agent session count, LLM call count, average duration
Query cache hit rate
Scheduled job success/failure rate
Document extraction queue depth
Disk usage (Postgres data, MinIO objects)