Operations
Monitoring
Container health checks, log aggregation, and system monitoring.
Uptime Kuma (Service Monitoring)
Uptime Kuma runs at http://{IP}:3001. It monitors all 15 containers via HTTP, TCP, and Docker health checks.
First-time setup:
- Open
http://{IP}:3001— create an admin account - Add a new monitor for each service:
- Type: HTTP for API services, Docker for containers
- Interval: 60 seconds (recommended)
- Retention: Keep defaults
Recommended monitors:
| Monitor | Type | Target |
|---|---|---|
| Laravel API | HTTP | http://ubios_api:8000/api/v1/health |
| Agno | HTTP | http://ubios_agno:8001/health |
| PostgreSQL | Docker | ubios_postgres |
| Redis | TCP | ubios_redis:6379 |
| MinIO | HTTP | http://ubios-minio:9000/minio/health/live |
| Metabase | HTTP | http://ubios_metabase:3000/api/health |
| Dify | HTTP | http://ubios_dify_api:5001/health |
| LiteLLM Proxy | HTTP | http://ubios_litellm:4000/health |
Dozzle (Log Viewer)
Dozzle runs at http://{IP}:8080. It shows real-time logs for all containers with zero configuration.
Features:
- View logs for any container by clicking its name
- Search logs with regex or SQL
- Split-screen to watch multiple containers
- Filter by severity level
Container Health Checks
Check all containers are running:
docker ps --format "table {{.Names}}\t{{.Status}}"Expected: all 15 containers show Up with (healthy) for services that define health checks (postgres, minio).
Individual Health Endpoints
| Service | Command | Expected |
|---|---|---|
| Agno | curl http://$HOST_IP:8001/health | {"status": "ok"} |
| LiteLLM Proxy | curl http://$HOST_IP:4000/health | {"status": "ok"} |
| PostgreSQL | docker inspect ubios_postgres --format='{{.State.Health.Status}}' | healthy |
| MinIO | curl -sf http://$HOST_IP:9000/minio/health/live | 200 OK |
Redis Monitoring
# Check Redis is responding
docker exec -it ubios_redis redis-cli ping
# Expected: PONG
# Memory usage
docker exec -it ubios_redis redis-cli info memory | grep used_memory_human
# Connected clients
docker exec -it ubios_redis redis-cli info clients | grep connected_clients
# Cache hit rate
docker exec -it ubios_redis redis-cli info stats | grep keyspace_hitsQueue Monitoring
# Check queue worker is running
docker compose -f docker-compose.ip-test.yml logs --tail=20 queue
# Failed jobs
docker exec -it ubios_api php artisan queue:failed
# Retry a failed job
docker exec -it ubios_api php artisan queue:retry {job_id}Agent Activity
# Recent agent outputs
docker exec -it ubios_api bash -c '
PGPASSWORD=331331331 psql -U postgres -h ubios_postgres -d ubios -c \
"SELECT agent_name, output_type, severity, is_read, created_at
FROM ubios_agent_state.agent_outputs
ORDER BY created_at DESC LIMIT 10;"
'
# Scheduled job status
docker exec -it ubios_api bash -c '
PGPASSWORD=331331331 psql -U postgres -h ubios_postgres -d ubios -c \
"SELECT * FROM ubios_agent_state.scheduled_jobs WHERE is_active = true;"
'Log Aggregation
All services log to stdout/stderr (Docker logging driver). View them via Dozzle (http://{IP}:8080) or the CLI:
# Follow all logs with timestamps
docker compose -f docker-compose.ip-test.yml logs -f -t
# Filter by severity (Agno)
docker compose -f docker-compose.ip-test.yml logs -f agno 2>&1 | grep -i error
# Laravel error log
docker exec -it ubios_api tail -f storage/logs/laravel.logKey Metrics to Watch
| Metric | Warning | Critical | Where to check |
|---|---|---|---|
| PostgreSQL connections | > 80% pool | > 95% | pg_stat_activity |
| Redis memory | > 80% maxmemory | > 95% | redis-cli info memory |
| Queue depth | > 100 jobs | > 500 jobs | queue:failed + Redis list length |
| Agno response time | > 5s P95 | > 10s P95 | GET /health + application logs |
| Disk usage | > 80% | > 90% | df -h on host |
Production Monitoring (Target)
Phase 3 includes centralized monitoring via Prometheus + Grafana on a control plane server:
- Container health status (all 12+ containers)
- PostgreSQL connection pool saturation and query latency
- Redis memory usage and eviction rate
- Agent session count, LLM call count, average duration
- Query cache hit rate
- Scheduled job success/failure rate
- Document extraction queue depth
- Disk usage (Postgres data, MinIO objects)