Observability¶
Two modules provide observability: failover-scanner discovers @Failover methods at startup; failover-observable-micrometer adds Micrometer counters and a health indicator.
failover-scanner¶
Walks the Spring ApplicationContext at startup, finds all @Failover-annotated methods, and registers them with the ObservablePublisher.
<dependency>
<groupId>com.societegenerale.failover</groupId>
<artifactId>failover-scanner</artifactId>
<version>3.0.0</version>
</dependency>
At startup, the scanner logs a summary:
INFO FailoverScanner: Discovered 5 @Failover methods:
- country-by-code (domain=country, expiry=24h)
- all-countries (domain=country, expiry=24h)
- product-by-id (expiry=6h)
- exchange-rates (expiry=1h)
- client-profile (expiry=12h)
The scanner also warns when two @Failover annotations share a domain but have mismatched expiry configurations.
failover-observable-micrometer¶
Extends the scanner with Micrometer counters and a Spring Boot Actuator health indicator.
<dependency>
<groupId>com.societegenerale.failover</groupId>
<artifactId>failover-observable-micrometer</artifactId>
<version>3.0.0</version>
</dependency>
Includes failover-scanner transitively.
Micrometer Counter¶
Counters: failover.store.total{name, stored} (one per store) and failover.recover.total{name, recovered, recovery_failed} (one per recover attempt). A Timer failover.operation.duration{name, action} records wall time.
Counter name: failover.recovery.outcome.total — one event per intercepted method call; the source for the failover / recovery / non-recovery rates. See Observability how-to.
| Tag | Values |
|---|---|
name | The @Failover(name=...) value |
domain | The @Failover(domain=...), falling back to name |
method | The intercepted method as SimpleClass#method |
outcome | recovered, not_recovered, error |
Counter name: failover.store.async.failed — incremented when an async write fails inside the executor (the async store layer is otherwise visible only in logs).
| Tag | Values |
|---|---|
name | The @Failover(name=...) value |
operation | store, delete, cleanByExpiry |
exception_type | The failure's class name |
Full meter catalog¶
All failover.* meters (counters keep the _total suffix in Prometheus; timers export _sum/_count/_max in seconds; gauges export the bare name). An instance tag for cluster attribution is added automatically by failover.observable.instance.mode (default auto — tags push registries like OTLP/Elastic, skips a Prometheus registry since the scrape adds instance itself; always/never override). Configure it on the @Failover service; on k8s/Docker set failover.observable.instance.id=${HOSTNAME}.
| Meter | Type | Key tags | Meaning |
|---|---|---|---|
failover.call.total | counter | name, domain, result (success|failover) | Per-call volume — clean upstream success vs failover triggered. |
failover.user.impact.total | counter | name, domain, impact (unblocked|blocked) | Business signal — caller got a value (fresh or recovered) vs got nothing. |
failover.recovery.outcome.total | counter | name, domain, method, outcome | Recovery breakdown (recovered/not_recovered/error); source of the rates. |
failover.recovery.partial.total | counter | name, method | Scatter/gather recoveries where some slices were missing. |
failover.exception.total | counter | name, exception_type, cause_type, final_cause_type | Which exception (and root cause) triggered failover. |
failover.store.total | counter | name, stored | Store attempts. |
failover.store.async.failed | counter | name, operation, exception_type | Async store-layer failures. |
failover.operation.duration | timer (+percentile histogram) | name, action (store|recover) | Store/recover path latency → p50/p95/p99. |
failover.upstream.duration | timer (+percentile histogram) | name, result (success|failure) | Latency of the protected upstream call itself. |
failover.api.health | gauge | name, domain | Recent fraction of calls where the caller got a value (1.0 healthy; lower = users blocked). |
failover.stale.served.ratio | gauge | name, domain | Recent fraction of calls served from stored (stale) data. |
failover.live.entries | gauge | name, domain | Current stored entry count (cache footprint). In-memory/Caffeine stores only — absent for JDBC/multi-tenant. |
failover.metrics.dropped.total | counter | — | Metrics dropped because the non-blocking publish queue was full (see non-blocking). Active only when async publishing is on. |
failover.registered.total | gauge | — | Number of discovered @Failover methods. |
failover.config.expiry.seconds | gauge | name, domain, unit | Configured expiry per failover point. |
Cardinality: name/domain/action/result/impact/outcome are low-cardinality enums; exception tags use class names. Never tag with the raw store key or exception messages. A guard (failover.observable.cardinality) caps distinct name values.
Health Indicator¶
Registered at /actuator/health under the failover component:
{
"failover": {
"status": "UP",
"details": {
"enabled": "true",
"type": "BASIC",
"store.type": "JDBC",
"store.jdbc.table-prefix": "MYAPP_",
"scheduler.enabled": "true"
}
}
}
ObservablePublisher SPI¶
AdvancedFailoverHandler calls ObservablePublisher.publish(Metrics) after every store and recover event. Implement this interface to route metrics to any custom sink:
@Component
public class MyPublisher implements ObservablePublisher {
@Override
public void publish(Metrics metrics) {
log.info("failover event: name={} action={} duration={}ns",
metrics.getName(),
metrics.get("action"),
metrics.get("duration-ns"));
}
}
Metrics.toMap() returns all key/value pairs collected during the operation.
Non-blocking by construction¶
Every ObservablePublisher — the built-in ones and your custom bean — runs off the caller's thread, so publishing can never block or slow the @Failover business call. You get this for free; no async code in your publisher.
How: all ObservablePublisher beans are gathered into a single CompositeObservablePublisher, which is wrapped in an AsyncObservablePublisher. The @Failover path only ever calls that wrapper — it does a bounded, non-blocking hand-off to a virtual-thread drain worker, and your publish(...) runs there. A full queue drops the metric (counted as failover.metrics.dropped.total) rather than back-pressuring the caller.
Implications for a custom publisher:
- Do not assume
publish(...)runs on the request thread — noThreadLocal/request-scoped state, no MDC unless you set it yourself. - A slow or failing publisher cannot stall the business call; an exception is logged and the drain loop continues.
- Disable globally for deterministic tests with
failover.observable.async.enabled=false(publishes synchronously). Tune the buffer withfailover.observable.async.queue-capacity.
The Write Axis (failover service) — how meters leave the app¶
The write axis = the failover service (the app with @Failover). It emits failover.* meters; it never reads them back. (The read axis = the dashboard service, covered in Dashboard.) Two stages: (1) the framework records the meter in the local Micrometer registry off the caller thread; (2) a Micrometer exporter ships it to a backend (or Prometheus scrapes it).
Single instance¶
flowchart LR
A["@Failover method"] -->|store / recover event| P["AsyncObservablePublisher<br/>(off caller thread)"]
P --> M["Micrometer registry<br/>failover.* meters"]
M -->|"scrape /actuator/prometheus<br/>or push (OTLP/Elastic)"| B[("backend")] One JVM, one registry. The numbers are complete for that JVM. Nothing to attribute — instance doesn't matter.
Multiple instances¶
Each instance has its own registry and emits its own failover.*. To tell them apart downstream, every series needs an instance label — supplied either by Prometheus (scrape) or by the app (push). failover.observable.instance.mode=auto does the right thing per backend (tags push, skips Prometheus).
flowchart LR
subgraph C["Write axis — failover service · N instances"]
I1["instance-1<br/>registry"]
I2["instance-2<br/>registry"]
I3["instance-3<br/>registry"]
end
I1 & I2 & I3 -->|"each tagged instance=<id><br/>(scrape adds it, or mode=auto on push)"| B[("backend<br/>Prometheus / OTLP collector")] mode=auto per backend:
flowchart TB
R["failover.* meter id"] --> Q{registry type?}
Q -->|Prometheus| K["leave untagged<br/>(scrape adds instance)"]
Q -->|"push: OTLP / Elastic / Datadog"| T["add instance=<id><br/>(no scrape-time label otherwise)"]
Q -->|"composite (both)"| Z["tag the push delegate,<br/>skip the Prometheus delegate"] Choosing a Micrometer registry¶
The framework adds no exporter — you choose the backend by putting a Micrometer registry on the classpath; Spring Boot auto-configures it and the failover.* meters flow through automatically (failover.observable.instance.mode=auto adds the per-instance label on push registries, skips it on Prometheus — §The Write Axis). Pick by how the backend ingests metrics:
| Use case | Registry dependency | Model | Notes |
|---|---|---|---|
| Local Prometheus / Grafana | io.micrometer:micrometer-registry-prometheus | scrape (/actuator/prometheus) | Most common; Prometheus adds instance at scrape. The dashboard cluster.mode=prometheus reads it back via PromQL. |
| Vendor-neutral, one exporter for many backends | io.micrometer:micrometer-registry-otlp | push (OTLP) | Recommended for multi-backend: app → OpenTelemetry Collector → fan-out to Prometheus / Elastic / Datadog / Grafana Cloud. mode=auto tags instance. |
| Elastic / ELK | io.micrometer:micrometer-registry-elastic | push | Metrics into Elasticsearch; pairs with Kibana. |
| Datadog | io.micrometer:micrometer-registry-datadog | push | Direct to the Datadog API (or via OTLP). |
| New Relic | io.micrometer:micrometer-registry-new-relic | push | Direct to New Relic. |
| AWS CloudWatch | io.micrometer:micrometer-registry-cloudwatch2 | push | For AWS-native dashboards/alarms. |
| InfluxDB | io.micrometer:micrometer-registry-influx | push | Time-series DB; Grafana on top. |
| Graphite | io.micrometer:micrometer-registry-graphite | push (hierarchical) | Legacy/StatsD-style stacks. |
Dev / tests / embedded dashboard local | SimpleMeterRegistry (in micrometer-core, always present) | in-memory | No export; the dashboard's local mode reads this registry directly. |
Notes:
- No failover module is needed for any of these — they're plain Micrometer registries the consumer adds. The dashboard's read side (
cluster.mode) is separate from the write side (which registry you export with) — and they don't pair 1:1: only Prometheus is read back by the embedded dashboard; for other push backends use the vendor's UI orshared-store. See Pairing the read axis with your backend. - Multiple at once: add several registries and Spring Boot creates a
CompositeMeterRegistry—failover.*is published to all;mode=autotags only the push delegates. - Versions are managed by
spring-boot-dependencies— declare themicrometer-registry-*artifact without a version. - Spring Boot exposes registry settings under
management.<registry>.metrics.export.*(e.g.management.otlp.metrics.export.url,management.prometheus.metrics.export.enabled).
See the Micrometer docs for the full list, configuration, and capabilities of each registry: https://docs.micrometer.io/micrometer/reference/implementations.html (and Spring Boot's metrics export reference).
Next Steps¶
- Observability How-to — Prometheus/Grafana setup
- Dashboard — the read axis (single vs cluster) over these meters
- Scheduler — daily report publisher