Stop cascading outages.
One annotation.
Failover stores every successful response from your referential services and replays the last known-good result when upstream calls fail — transparently, with zero boilerplate.
Replace fragile try/catch with one annotation
Every team reinvents the same resilience wheel. Failover removes it entirely.
public Country findByCode(String code) {
try {
Country c = upstream.findByCode(code);
localRepo.save(c, computeExpiry());
return c;
} catch (Exception e) {
log.warn("upstream failed, trying local cache");
Country cached = localRepo.findByCode(code);
if (cached == null || isExpired(cached)) {
throw e;
}
cached.setUpToDate(false);
return cached;
}
} @Failover(
name = "country-by-code",
expiryDuration = 24,
expiryUnit = ChronoUnit.HOURS
)
Country findByCode(String code); Why referential services need special care
In microservice platforms your application calls services it doesn't own. When those fail, the cascade reaches your users — and there's nothing you can do to fix the upstream.
Three layers of dependency
Most platforms share the same pattern: internal services you control, transversal services owned by other teams, and external services owned by third parties.
- Internal services — full ownership, fast resolution
- Transversal services (R) — managed by other teams, slow escalation path
- External services (E) — third-party SLA, no direct control
- Failures on referential systems cascade through every dependent service
Service dependency model
Cascade failure in practice
One outage cascades to every user
When a transversal or external service fails, the error propagates through every dependent service — returning 500s to users who have no visibility into why.
- Application team has no control over the upstream failure
- Escalation and resolution take hours or days
- Every team reinvents the same fragile try/catch workaround
- End users are fully blocked until the referential system recovers
Failover intercepts — transparently
Failover sits between your service and the referential system. On success it stores the result with a configured TTL. On failure it serves the last known-good value — no 500, no user impact.
Failover in the platform · store flow
Store · intercept · replay · recover flow
Users stay unblocked — even during outages
Without Failover a referential failure returns a 500 and blocks the user completely. With Failover the last stored result is served — marked with its cached timestamp, but fully functional.
upToDate=true.500 Internal Server Error. Completely blocked.upToDate=false, asOf set. User continues unblocked.Built-in metrics — zero extra instrumentation
Every store and recover event emits Micrometer counters automatically. Connect to Elastic, Grafana, or any metrics backend. Three dedicated panels give complete visibility into failure behaviour.
Failover configuration dashboard
| Failover Name | Expiry Duration | Expiry Unit | Failover Type | Store Type |
|---|---|---|---|---|
country-by-code | 24 | HOURS | basic | jdbc |
currency-list | 6 | HOURS | basic | caffeine |
client-profile | 12 | HOURS | resilience | jdbc |
market-calendar | 7 | DAYS | basic | caffeine |
Total upstream failures intercepted per referential over time.
Failures resolved with a stored result — users unblocked.
Failures with no stored result — actual user impact needing attention.
Everything you need, nothing you don't
Every extension point is a pluggable SPI — swap, extend, or replace any behaviour.
Automatic store on success
Every successful response is persisted under a derived key. No explicit save calls. No repository wiring.
Transparent recovery on failure
When upstream throws, the last stored result for that key is returned. Callers never see the exception.
Business-configured TTL
Fixed duration, SpEL expressions, or a custom ExpiryPolicy. Expired entries are never served.
Pluggable backing stores
InMemory · Caffeine · JDBC (H2, PostgreSQL, MySQL, Oracle…) · or any custom FailoverStore bean.
Scatter / Gather
Collection-returning methods split into per-entity store entries. Partial recovery handled gracefully.
Multi-tenant isolation
TABLE_PREFIX or SCHEMA strategy routes each request to the correct tenant store automatically.
Async non-blocking writes
Store operations offloaded to a virtual-thread executor. Read path stays synchronous. Zero added latency.
Observable out of the box
Every store/recover event emits structured SLF4J logs and Micrometer counters. No extra instrumentation.
Resilience4j integration
Circuit-breaker wraps upstream calls when type: resilience. Trips fast on repeated failures.
Works with your existing stack
No new runtime dependencies forced on you — every integration is opt-in via the corresponding module.
Dozens of services at Société Générale depend on the same small set of referential systems — currency tables, country lists, client profiles — that change slowly but are queried constantly. A single referential outage cascades into a full-platform incident. Failover was built to break that coupling once, reusably, across every service.
— Origins of Failover · See ADR 1 for the founding decisionModule overview
One starter pulls in everything. Pick individual modules when you need fine-grained control.
├── failover-domain @Failover annotation · Referential · ReferentialAware · Metadata
├── failover-core FailoverHandler · KeyGenerator · ExpiryPolicy · PayloadEnricher · ContextPropagator
├── failover-aspect Spring AOP @Around interceptor
├── failover-store-inmemory ConcurrentHashMap store — dev / test only, not persistent
├── failover-store-caffeine Caffeine-backed in-process store
├── failover-store-jdbc JDBC store — H2 · PostgreSQL · MySQL · MariaDB · Oracle · SQL Server
├── failover-store-async non-blocking write decorator (virtual-thread executor)
├── failover-store-multitenant TABLE_PREFIX / SCHEMA per-tenant routing
├── failover-execution-resilience Resilience4j circuit-breaker integration
├── failover-scheduler expiry-cleanup scheduler · report-publisher scheduler
└── failover-spring-boot-autoconfigure zero-config Spring Boot auto-configuration assembler
How it works
Spring AOP intercepts every annotated method. The rest is automatic.
Call flow · Entry lifecycle · Sequence
flowchart LR
C([Your Code]) --> A{FailoverAspect}
A -->|invoke| U([Upstream API])
U -- "✅ success" --> ST[Store payload + TTL]
ST --> R1([return upToDate=true])
U -- "❌ failure" --> Q[Query store]
Q -- "fresh entry" --> R2([return upToDate=false])
Q -- "expired / missing" --> R3([re-throw or null]) stateDiagram-v2
direction LR
[*] --> Live : first successful call
Live --> Live : success — TTL refreshed
Live --> Stale : upstream fails, entry still fresh
Stale --> Live : upstream recovers
Stale --> Expired : TTL exceeded
Expired --> [*] : re-throw / null per exception-policy sequenceDiagram
participant C as Caller
participant A as FailoverAspect
participant H as FailoverHandler
participant K as KeyGenerator
participant E as ExpiryPolicy
participant S as FailoverStore
participant U as Upstream
C->>A: invoke @Failover method(args)
A->>U: call upstream
alt Upstream succeeds
U-->>A: result
A->>K: key(failover, args)
K-->>A: storeKey
A->>E: computeExpiry(failover)
E-->>A: expireOn
A->>S: store(name, key, result, expireOn)
A-->>C: result (upToDate=true)
else Upstream throws
U-->>A: exception
A->>K: key(failover, args)
K-->>A: lookupKey
A->>S: find(name, lookupKey)
alt Found and not expired
S-->>A: ReferentialPayload
A-->>C: payload (upToDate=false, asOf=storedTime)
else Not found or expired
S-->>A: empty / expired
A-->>C: null or rethrow (per ExceptionPolicy)
end
endOn success — result persisted under the derived key with the configured TTL. upToDate=true set on the returned object.
On failure — last stored result returned. If none or expired: re-throw (default) or return null via exception-policy: never_throw.
Where to go next
Quickstart
Working end-to-end example in 5 minutes — one dependency, one annotation, one config block.
Get started →Installation
Maven and Gradle coordinates for the starter and every individual module.
View dependencies →Concepts
Store/recover lifecycle, key derivation, expiry policies, scatter/gather internals.
Learn how it works →Configuration
Every failover.* property with types, defaults, and full examples.
ADR Index
27 architecture decisions — the why behind every design choice in the framework.
Browse decisions →