z4j vs. Grafana + Prometheus: Task-Level Control vs. Aggregate Metrics
The DIY metrics approach. Great for SRE, wrong tool for task ops.
- Scope
- Any engine, via custom exporters
- Positioning
- Metrics and alerting, not task-level control
- License
- AGPL-3 (Grafana) + Apache-2 (Prometheus)
Credit where it's due
Reasons to choose z4j over Grafana + Prometheus
Different layers, both useful
Grafana is the right tool for aggregate metrics, SLO dashboards, and on-call alerting. z4j is the right tool for the operator who has to retry the stuck job, edit the schedule, or audit who triggered what. They are not substitutes; they sit on different layers of the stack.
Per-task drill-down that metrics cannot give
Prometheus stores numbers. The args, kwargs, exception traceback, and full task lineage of a specific failed run are not in the metric. z4j keeps every event with structured fields and per-task redaction, so you can investigate the one task that broke without grepping logs.
Real actions on tasks, not just dashboards
Grafana cannot retry a task or edit a Celery Beat schedule. z4j has retry, cancel, bulk actions, and schedule CRUD as first-class buttons in the UI, with every action recorded in the audit log.
Setup cost, measured honestly
Building a Grafana dashboard for Celery from scratch (exporter, scrape config, dashboards, alert rules) is hours to days of platform work. z4j is one docker compose up. We recommend running both: z4j for task ops, Grafana for SLOs.
z4j vs. Grafana + Prometheus, feature by feature
| Capability | Grafana + Prometheus | z4j |
|---|---|---|
| Aggregate throughput / latency charts | Excellent | Built-in, no exporter needed |
| Per-task drill-down (args, kwargs, stack) | Not possible - metrics strip that | Every task, every attempt, with redaction |
| Retry / cancel / bulk actions | None - read-only dashboards | Universal across every engine |
| Schedule CRUD | None | Create / edit / delete / trigger-now |
| Audit log | None | HMAC-chained, tamper-evident |
| Setup cost | Hours to days - exporters, dashboards, alert rules | One docker compose up |
| Engine-specific signals | Whatever you export yourself | Built-in adapters for 6 engines |
| Multi-tenant project scoping | Grafana Orgs (heavy) | Per-project RBAC out of the box |
z4j vs. Grafana + Prometheus: FAQ
Should I replace Grafana with z4j?
No. Grafana and z4j solve different problems. Grafana is the right place for aggregate latency, throughput, error-rate alerting, and SLO dashboards. z4j owns the task-level layer: per-job history, retry/cancel actions, schedule CRUD, and audit. Most teams run both.
Does z4j export Prometheus metrics?
Yes, at /metrics on z4j. Your existing Grafana stack can scrape z4j the same way it scrapes any other service. The metric set covers task throughput, queue depth, agent connection state, and event-pipeline lag.
Can z4j replace celery-prometheus-exporter?
For most teams, yes. z4j's /metrics endpoint covers what celery-prometheus-exporter exposes plus engine-agnostic metrics across RQ, Dramatiq, Huey, arq, and taskiq. If you have custom exporter rules you depend on, run them alongside z4j; the two do not conflict.
Where should alerts live, Grafana or z4j?
Both. Use Grafana for site-wide SLO and infrastructure alerting via Prometheus rules and Alertmanager. Use z4j's notifications for task-specific events (failure spikes, queue backlog, scheduler drift) routed to email, Slack, Telegram, or webhook with cooldowns.
Where we don't win
If your platform team already runs Grafana, you already have aggregate metrics. z4j is a complement, not a replacement, it owns the task-level layer.
For site-wide SLO dashboards and alerting, Grafana is the right tool. z4j focuses on task operators: the person who has to figure out why a specific job is stuck.
We recommend running both. z4j exposes Prometheus metrics at /metrics so your Grafana dashboards can pull from it.
Compared to other dashboards
Try z4j alongside Grafana + Prometheus, no migration required
Run both for a week. Compare. Decide.