Observability¶

Failover emits structured logs and Micrometer counters on every operation. Add failover-observable-micrometer to enable metrics and the Actuator health indicator.

Enable Micrometer Metrics¶

pom.xml

<dependency>
    <groupId>com.societegenerale.failover</groupId>
    <artifactId>failover-observable-micrometer</artifactId>
    <version>3.0.0</version>
</dependency>

Requires micrometer-core and spring-boot-starter-actuator on the classpath (typically already present).

Micrometer Counters¶

MicrometerObservablePublisher emits a dedicated meter per operation:

Meter	Type	Tags
`failover.store.total`	Counter	`name`, `stored`
`failover.recover.total`	Counter	`name`, `recovered`, `recovery_failed`
`failover.recovery.outcome.total`	Counter	`name`, `domain`, `method`, `outcome` (see below)
`failover.recovery.partial.total`	Counter	`name`, `method` (scatter/gather: some-but-not-all slices recovered)
`failover.exception.total`	Counter	`name`, `exception_type`, `cause_type`
`failover.operation.duration`	Timer	`name`, `action`
`failover.store.async.failed`	Counter	`name`, `operation`, `exception_type`

failover.store.total — one increment per store; stored=true|false.
failover.recover.total — one increment per recover attempt (one per intercepted method call).
failover.recovery.outcome.total — the per-method failover / recovery / non-recovery rates (below).

Gauges (configuration & capacity)¶

FailoverMeterBinder registers these gauges:

Gauge	Tags	Meaning
`failover.registered.total`	—	Number of `@Failover`-annotated methods discovered.
`failover.config.expiry.seconds`	`name`, `domain`, `unit`	Configured expiry per failover.
`failover.live.entries`	`name`, `domain`	Current stored entry count per failover (cache footprint / capacity).

failover.live.entries is registered only when the store can report its size: always for in-memory / Caffeine, and for JDBC only when opted in with failover.store.jdbc.live-entries-gauge-enabled=true (it issues a SELECT COUNT(*) per scrape; audit A7). Use it to monitor table growth and alert on capacity. Not available in multi-tenant mode.

Failover / Recovery / Non-recovery Rate (per method)¶

failover.recovery.outcome.total is a single counter, recorded once per intercepted method call (the composite — a findAll() is one event, not one per entity), from which the three operational rates are derived. It is tagged by the actual method, so two methods sharing a referential name /domain (e.g. getById and findAll) are distinguishable.

Counter	Tag	Values
`failover.recovery.outcome.total`	`name`	the `@Failover(name=...)` value
	`domain`	the `@Failover(domain=...)`, falling back to `name`
	`method`	the intercepted method as `SimpleClass#method` (e.g. `CountryService#findAll`)
	`outcome`	`recovered`, `not_recovered`, `error`

recovered — a stored value was returned within its expiry (user unblocked).
not_recovered — no stored value (not found or expired) — actual user impact.
error — the recover path itself threw (store/serialization fault); kept distinct so a fault is never miscounted as a clean miss.

# Failover rate — upstream failures intercepted, per method
sum(rate(failover_recovery_outcome_total[5m])) by (name, method)

# Recovery rate — failures resolved from the store
rate(failover_recovery_outcome_total{outcome="recovered"}[5m])

# Non-recovery rate — failures with no stored result (alert on this)
rate(failover_recovery_outcome_total{outcome="not_recovered"}[5m])

# Alert: non-recovery share of intercepted failures climbs for a method
sum by (name, method) (rate(failover_recovery_outcome_total{outcome="not_recovered"}[5m]))
/
sum by (name, method) (rate(failover_recovery_outcome_total[5m]))
  > 0.2

Async Store Failure Counter¶

When failover.store.async=true (the default), store writes run on a background executor and any failure there is swallowed so it never breaks the business call. To keep a silently-degraded store layer visible, FailoverStoreAsync emits a dedicated counter on every dropped write:

Counter	Tag	Values
`failover.store.async.failed`	`name`	The `@Failover(name=...)` value
	`operation`	`store`, `delete`, `cleanByExpiry`
	`exception_type`	The failure's class name

This counter covers both ways an async write can be lost:

In-flight failure — the task ran but the delegate store threw (e.g. DB down, connection pool exhausted). exception_type is the store's exception.
Submit-time rejection — the task was never accepted because the executor was saturated or shutting down. With a bounded executor (failover.store.async-executor.concurrency-limit > 0, rejection policy ABORT) exception_type is java.util.concurrent.RejectedExecutionException. This closes the gap where a saturated executor would otherwise drop writes with no metric.

Alert on any increase — it means failover data is not being persisted:

increase(failover_store_async_failed_total[5m]) > 0

DISCARD rejection policy

The DISCARD rejection policy intentionally drops the task and logs a WARN without incrementing this counter — discarding is the configured behaviour, not a failure. If you need every dropped write counted as an alertable signal, use the ABORT policy (the rejection is then metered as above) rather than DISCARD.

Prometheus Scrape Example¶

prometheus.yml

scrape_configs:
  - job_name: myapp
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: /actuator/prometheus

Query to track the recovery success ratio (recovered ÷ all intercepted failures), per method:

sum by (name, method) (rate(failover_recovery_outcome_total{outcome="recovered"}[5m]))
/
sum by (name, method) (rate(failover_recovery_outcome_total[5m]))

Kibana Dashboard¶

micrometer-registry-elastic pushes failover.* counters and timers to Elasticsearch on a configurable interval. Build native Kibana visualizations directly from that data — no additional instrumentation needed.

1. Add the Elastic Registry¶

pom.xml

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-elastic</artifactId>
</dependency>

2. Configure the Export¶

application.yml

management:
  elastic:
    metrics:
      export:
        enabled: true
        host: https://your-elasticsearch:9200
        index: failover-metrics
        step: 60s
        user-name: elastic
        password: ${ES_PASSWORD}
        # api-key-credentials: ${ES_API_KEY}  # alternative to user/password

Metrics land in index failover-metrics-YYYY-MM-DD by default. Each document contains the metric name, its type-specific value fields (count, sum, mean, max), the push @timestamp, and all Micrometer tags as top-level fields.

3. Create a Kibana Data View¶

In Kibana → Stack Management → Data Views, create:

Field	Value
Name	`Failover Metrics`
Index pattern	`failover-metrics-*`
Time field	`@timestamp`

4. Key Visualizations (Lens)¶

Open Kibana → Dashboards → Create and add panels using the Lens editor.

Recovery Rate — line chart over time¶

Setting	Value
X-axis	`@timestamp` (date histogram, auto / 1 min)
Y-axis	`Sum(count)` where `name: failover.recovery.outcome.total AND outcome: recovered` ÷ `Sum(count)` where `name: failover.recovery.outcome.total`
Break by	`name.keyword`

Ratio formula in Lens Formula layer: count(kql='name: "failover.recovery.outcome.total" AND outcome: "recovered"') / count(kql='name: "failover.recovery.outcome.total"').

Non-Recovery Events — bar chart (alert target)¶

Setting	Value
X-axis	`@timestamp` (date histogram, 5 min)
Y-axis	`Sum(count)` where `name: failover.recovery.outcome.total AND outcome: not_recovered`
Break by	`name.keyword`

Every non-zero bar represents a caller that received no value (user blocked).

Exception Breakdown per Endpoint — grouped bar¶

Setting	Value
X-axis	`name.keyword` (top 10)
Y-axis	`Sum(count)` where `name: failover.exception.total`
Break by	`exception_type.keyword`

Shows which endpoint throws which exception type most — aids root-cause triage without navigating logs.

Async Store Failures — metric (single value)¶

Setting	Value
Value	`Sum(count)` where `name: failover.store.async.failed`
Time range	Last 1 h

Any non-zero value means failover data is silently not being persisted.

Recovery Latency — line chart¶

Micrometer timers emit mean and max per push interval.

Setting	Value
X-axis	`@timestamp`
Y-axis	`Avg(mean)` where `name: failover.operation.duration AND action: recover`

True percentiles

Enable histogram publishing to get p95/p99 bucket fields:

management.elastic.metrics.export.histogramPublish: true

Then use the histogram bucket fields in Kibana TSVB percentile aggregation.

5. Recommended Dashboard Layout¶

┌────────────────────────────────────────────────────────────────────┐
│  Recovery Rate (line, per endpoint)  │  Non-Recovery (bar, 5 min) │
├──────────────────────────────────────┴────────────────────────────┤
│  Exception Breakdown per Endpoint (grouped bar — full width)       │
├──────────────────────┬──────────────────────┬─────────────────────┤
│  Recovery Latency    │  Async Store Failures │  Registered Endpoints │
└──────────────────────┴──────────────────────┴─────────────────────┘

6. Alerting Rules¶

In Kibana → Stack Management → Rules → Create rule → Elasticsearch query:

Non-recovery spike (user-impacting — page):

KQL:   name: "failover.recovery.outcome.total" AND outcome: "not_recovered"
Agg:   Sum(count) over last 5 min
When:  > 0

Async store failure (data-loss risk — page):

KQL:   name: "failover.store.async.failed"
Agg:   Sum(count) over last 5 min
When:  > 0

Recovery rate below threshold (warn):

KQL:   name: "failover.recovery.outcome.total"
Agg:   Sum(count, filter: outcome:"recovered") / Sum(count) < 0.7   (10-min window)
When:  formula result < 0.7

Actuator Health Indicator¶

The FailoverHealthIndicator is registered at /actuator/health:

{
  "status": "UP",
  "components": {
    "failover": {
      "status": "UP",
      "details": {
        "enabled": "true",
        "type": "BASIC",
        "store.type": "JDBC"
      }
    }
  }
}

Structured Logs¶

Level	Event	Logger
`INFO`	Entry stored successfully (referential name only)	`DefaultFailoverHandler`
`INFO`	Entry recovered successfully (referential name only)	`DefaultFailoverHandler`
`DEBUG`	Full `ReferentialPayload` body on store / recover / expired-delete	`DefaultFailoverHandler`
`WARN`	No stored entry found	`DefaultFailoverHandler`
`INFO`	Expiry cleanup executed	`DefaultFailoverHandler`
`ERROR`	Recovery threw unexpected exception	`AdvancedFailoverHandler`

The lifecycle events stay at INFO carrying only the referential name; the full payload body is logged at DEBUG so high-throughput services pay no full-payload serialisation on the INFO path (ADR 48).

Adjust Log Level¶

application.yml

logging:
  level:
    com.societegenerale.failover: INFO    # DEBUG for full payload detail

Custom ObservablePublisher¶

AdvancedFailoverHandler calls ObservablePublisher.publish(Metrics) on every operation. Implement ObservablePublisher to send metrics to a custom sink:

KafkaObservablePublisher.java

@Component
public class KafkaObservablePublisher implements ObservablePublisher {

    @Override
    public void publish(Metrics metrics) {
        kafkaTemplate.send("failover-events", metrics.getName(), metrics.toMap());
    }
}

Next Steps¶

Observability Module — module details and scanner
Scheduler Module — observable report scheduler