Failover

☕ Spring Boot 4 · Java 21 · Apache 2.0

Stop cascading outages.
One annotation.

Failover stores every successful response from your referential services and replays the last known-good result when upstream calls fail — transparently, with zero boilerplate.

🚀 Quickstart 📐 How it works ⭐ GitHub

11 Modules

27 ADRs

1 Annotation

4 Store types

0 Boilerplate

⚡The Problem → The Solution

Replace fragile try/catch with one annotation

Every team reinvents the same resilience wheel. Failover removes it entirely.

❌ Without Failover — bespoke, brittle, repeated everywhere

public Country findByCode(String code) {
    try {
        Country c = upstream.findByCode(code);
        localRepo.save(c, computeExpiry());
        return c;
    } catch (Exception e) {
        log.warn("upstream failed, trying local cache");
        Country cached = localRepo.findByCode(code);
        if (cached == null || isExpired(cached)) {
            throw e;
        }
        cached.setUpToDate(false);
        return cached;
    }
}

✅ With Failover — declarative, consistent, zero boilerplate

@Failover(
    name = "country-by-code",
    expiryDuration = 24,
    expiryUnit = ChronoUnit.HOURS
)
Country findByCode(String code);

🌐Context

Why referential services need special care

In microservice platforms your application calls services it doesn't own. When those fail, the cascade reaches your users — and there's nothing you can do to fix the upstream.

🔗Service Model

Three layers of dependency

Most platforms share the same pattern: internal services you control, transversal services owned by other teams, and external services owned by third parties.

Internal services — full ownership, fast resolution
Transversal services (R) — managed by other teams, slow escalation path
External services (E) — third-party SLA, no direct control
Failures on referential systems cascade through every dependent service

Service dependency model

🏢 Your Application full ownership · fast fix

your scope

🔗 Transversal Service other teams · R

org scope · slow escalation

🌐 External Service third-party SLA · E

outside org · no control

Cascade failure in practice

1 Your App → Transversal Service org scope · escalate within org

🔥 Transversal FAILING

→

💥 Your App 500 ERROR

→

😱 End Users BLOCKED

2 Your App → External Service outside org · SLA only · zero control

⚡ External Svc DOWN

→

💥 Your App 500 ERROR

→

😱 End Users BLOCKED

🔥The Challenge

One outage cascades to every user

When a transversal or external service fails, the error propagates through every dependent service — returning 500s to users who have no visibility into why.

Application team has no control over the upstream failure
Escalation and resolution take hours or days
Every team reinvents the same fragile try/catch workaround
End users are fully blocked until the referential system recovers

🛡️The Solution

Failover intercepts — transparently

Failover sits between your service and the referential system. On success it stores the result with a configured TTL. On failure it serves the last known-good value — no 500, no user impact.

Failover in the platform · store flow

👥 Caller

request

🏢 Your Service @Failover

upstream call ✅

☁️ Upstream UP

🔌 Failover Interceptor FailoverAspect

💾 Failover Store store(payload)

← { isUpToDate: true }

Store · intercept · replay · recover flow

👥 Caller

request

🏢 Your Service @Failover

upstream call ✗

☁️ Upstream DOWN

🔌 Failover Interceptor catches exception

💾 Failover Store find(key) → last stored

← { isUpToDate: false, asOf: "prev timestamp" }

👥User Impact

Users stay unblocked — even during outages

Without Failover a referential failure returns a 500 and blocks the user completely. With Failover the last stored result is served — marked with its cached timestamp, but fully functional.

✅

All services available

Upstream responds. Failover stores the result. upToDate=true.

No impact

❌

Upstream fails — no Failover

Exception propagates. User receives a 500 Internal Server Error. Completely blocked.

User blocked

🛡️

Upstream fails — with Failover

Last stored result returned. upToDate=false, asOf set. User continues unblocked.

User unblocked

📡Observability

Built-in metrics — zero extra instrumentation

Every store and recover event emits Micrometer counters automatically. Connect to Elastic, Grafana, or any metrics backend. Three dedicated panels give complete visibility into failure behaviour.

Failover configuration dashboard

Active failover configurations 4 active

Failover Name	Expiry Duration	Expiry Unit	Failover Type	Store Type
`country-by-code`	24	HOURS	basic	jdbc
`currency-list`	6	HOURS	basic	caffeine
`client-profile`	12	HOURS	resilience	jdbc
`market-calendar`	7	DAYS	basic	caffeine

Failover rate

Total upstream failures intercepted per referential over time.

Recovery rate

Failures resolved with a stored result — users unblocked.

Non-recovery rate

Failures with no stored result — actual user impact needing attention.

✨Capabilities

Everything you need, nothing you don't

Every extension point is a pluggable SPI — swap, extend, or replace any behaviour.

💾

Automatic store on success

Every successful response is persisted under a derived key. No explicit save calls. No repository wiring.

🔄

Transparent recovery on failure

When upstream throws, the last stored result for that key is returned. Callers never see the exception.

⏱️

Business-configured TTL

Fixed duration, SpEL expressions, or a custom ExpiryPolicy. Expired entries are never served.

🗄️

Pluggable backing stores

InMemory · Caffeine · JDBC (H2, PostgreSQL, MySQL, Oracle…) · or any custom FailoverStore bean.

🧩

Scatter / Gather

Collection-returning methods split into per-entity store entries. Partial recovery handled gracefully.

🏢

Multi-tenant isolation

TABLE_PREFIX or SCHEMA strategy routes each request to the correct tenant store automatically.

⚡

Async non-blocking writes

Store operations offloaded to a virtual-thread executor. Read path stays synchronous. Zero added latency.

📊

Observable out of the box

Every store/recover event emits structured SLF4J logs and Micrometer counters. No extra instrumentation.

🔌

Resilience4j integration

Circuit-breaker wraps upstream calls when type: resilience. Trips fast on repeated failures.

🔌Integrations

Works with your existing stack

No new runtime dependencies forced on you — every integration is opt-in via the corresponding module.

Spring Boot 4.x Spring AOP Spring Cloud OpenFeign Resilience4j Micrometer Caffeine Cache JDBC / H2 / PostgreSQL / MySQL / Oracle SLF4J / Logback Virtual Threads (Java 21)

Dozens of services at Société Générale depend on the same small set of referential systems — currency tables, country lists, client profiles — that change slowly but are queried constantly. A single referential outage cascades into a full-platform incident. Failover was built to break that coupling once, reusably, across every service.

— Origins of Failover · See ADR 1 for the founding decision

🏗️Architecture

Module overview

One starter pulls in everything. Pick individual modules when you need fine-grained control.

failover-spring-boot-starter ← the only dependency you need
├── failover-domain @Failover annotation · Referential · ReferentialAware · Metadata
├── failover-core FailoverHandler · KeyGenerator · ExpiryPolicy · PayloadEnricher · ContextPropagator
├── failover-aspect Spring AOP @Around interceptor
├── failover-store-inmemory ConcurrentHashMap store — dev / test only, not persistent
├── failover-store-caffeine Caffeine-backed in-process store
├── failover-store-jdbc JDBC store — H2 · PostgreSQL · MySQL · MariaDB · Oracle · SQL Server
├── failover-store-async non-blocking write decorator (virtual-thread executor)
├── failover-store-multitenant TABLE_PREFIX / SCHEMA per-tenant routing
├── failover-execution-resilience Resilience4j circuit-breaker integration
├── failover-scheduler expiry-cleanup scheduler · report-publisher scheduler
└── failover-spring-boot-autoconfigure zero-config Spring Boot auto-configuration assembler

⚙️Internals

How it works

Spring AOP intercepts every annotated method. The rest is automatic.

Call flow · Entry lifecycle · Sequence

flowchart LR
    C([Your Code]) --> A{FailoverAspect}
    A -->|invoke| U([Upstream API])
    U -- "✅ success" --> ST[Store payload + TTL]
    ST --> R1([return upToDate=true])
    U -- "❌ failure" --> Q[Query store]
    Q -- "fresh entry" --> R2([return upToDate=false])
    Q -- "expired / missing" --> R3([re-throw or null])

stateDiagram-v2
    direction LR
    [*]     --> Live    : first successful call
    Live    --> Live    : success — TTL refreshed
    Live    --> Stale   : upstream fails, entry still fresh
    Stale   --> Live    : upstream recovers
    Stale   --> Expired : TTL exceeded
    Expired --> [*]     : re-throw / null per exception-policy

sequenceDiagram
    participant C as Caller
    participant A as FailoverAspect
    participant H as FailoverHandler
    participant K as KeyGenerator
    participant E as ExpiryPolicy
    participant S as FailoverStore
    participant U as Upstream

    C->>A: invoke @Failover method(args)
    A->>U: call upstream
    alt Upstream succeeds
        U-->>A: result
        A->>K: key(failover, args)
        K-->>A: storeKey
        A->>E: computeExpiry(failover)
        E-->>A: expireOn
        A->>S: store(name, key, result, expireOn)
        A-->>C: result (upToDate=true)
    else Upstream throws
        U-->>A: exception
        A->>K: key(failover, args)
        K-->>A: lookupKey
        A->>S: find(name, lookupKey)
        alt Found and not expired
            S-->>A: ReferentialPayload
            A-->>C: payload (upToDate=false, asOf=storedTime)
        else Not found or expired
            S-->>A: empty / expired
            A-->>C: null or rethrow (per ExceptionPolicy)
        end
    end

On success — result persisted under the derived key with the configured TTL. upToDate=true set on the returned object.

On failure — last stored result returned. If none or expired: re-throw (default) or return null via exception-policy: never_throw.

📚Docs

Where to go next

🚀

Quickstart

Working end-to-end example in 5 minutes — one dependency, one annotation, one config block.

Get started →

📦

Installation

Maven and Gradle coordinates for the starter and every individual module.

View dependencies →

🧠

Concepts

Store/recover lifecycle, key derivation, expiry policies, scatter/gather internals.

Learn how it works →

⚙️

Configuration

Every failover.* property with types, defaults, and full examples.

Browse properties →

🏗️

ADR Index

27 architecture decisions — the why behind every design choice in the framework.

Browse decisions →

🤝

Contributing

Bug reports, feature proposals, pull requests — all welcome.

How to contribute →

Stop cascading outages.One annotation.