In 2025, every major AI provider made at least one change that broke something in production for someone. OpenAI retired model checkpoints with compressed notice windows. Anthropic restricted OAuth for third-party tools overnight. Google cut Gemini rate limits by 97% without announcement. Mistral updated data retention policies with implications for GDPR-sensitive workloads.
The teams that navigated these events without a crisis shared one characteristic: they had built their AI integrations to assume failure. Not failure of their own code — failure of the vendor. This guide is a practical architecture reference for doing the same thing.
The Three Failure Modes to Design Against
Before you design a redundancy architecture, you need to know what you are defending against. AI vendor failures fall into three distinct categories, each requiring a different architectural response.
Operational Failures
Operational failures are temporary: outages, elevated error rates, latency spikes. They resolve themselves, usually within hours. The design pattern for operational failures is request routing with automatic failover — when primary vendor requests fail, route to a secondary vendor. This is the most commonly implemented form of AI redundancy, and the easiest to build.
Structural Changes
Structural changes are permanent: price increases, rate limit reductions, capability restrictions. They do not resolve themselves. The design pattern for structural changes is vendor abstraction — your application code should never reference a specific vendor directly. It should reference a capability (text generation, embedding, classification) that is fulfilled by a vendor behind an abstraction layer you control.
Model Deprecations
Model deprecations are structural changes with a deadline. They require a planned migration, but the deadline creates urgency that does not exist with open-ended structural changes. The design pattern for model deprecations is lifecycle tracking — knowing exactly which model versions you depend on in production and monitoring those versions for deprecation notices with enough lead time to migrate before the crisis window.
The Abstraction Layer Pattern
The most important architectural decision you can make for AI vendor resilience is to never let provider-specific API schemas touch your application logic directly. Every production AI integration should go through a capability abstraction layer: a thin internal API that your application calls, which translates to provider-specific calls behind the scenes.
This pattern has three components. First, a unified request schema that your application uses — a simple interface like generate(prompt, options)that accepts parameters relevant to your use case without referencing any provider. Second, a provider adapter for each vendor you support that translates the unified schema to provider-specific API calls and normalizes the response back to a consistent format. Third, a routing layer that selects which adapter to use based on current availability, cost, and configured preferences.
Open-source tools like LiteLLM implement this pattern for the most common providers, giving you a single OpenAI-compatible interface that routes to over 100 provider models. For teams that need more control or have non-standard requirements, building the abstraction layer internally is a one-time cost that pays for itself the first time you need to switch providers quickly.
Selecting Your Redundancy Tier
Tier 1: Active-Passive Failover
Active-passive is the simplest and most common approach. One primary vendor handles all traffic. When requests to the primary fail (connection error, rate limit error, timeout), the routing layer automatically retries against a secondary vendor. The secondary is warm but idle — no traffic under normal conditions.
Active-passive handles operational failures well. It does not help with structural changes, because when your primary vendor raises prices by 40%, passive failover does not activate. For teams with limited operational complexity tolerance, active-passive is the right starting point — implement it correctly and it eliminates the most acute failure mode.
Tier 2: Active-Active Load Distribution
Active-active routes live traffic to multiple vendors simultaneously, distributing load by a configured ratio. Under normal conditions, you might route 70% of traffic to your primary vendor and 30% to a secondary. When the primary degrades, the routing ratio shifts dynamically.
Active-active provides operational redundancy without the cold-start problem of passive failover. It also gives you live behavioral data on secondary vendors — you know how Anthropic performs for your specific workload because you are regularly sending it 30% of real traffic, not just synthetic test requests. This makes structural change migrations faster because your secondary vendor integration is always production-tested.
Tier 3: Capability-Based Routing
Capability-based routing matches each request type to the vendor that performs best on that task. Code generation goes to the vendor with the highest score on your coding evaluation suite. Document summarization goes to the vendor with the lowest token costs for long-context inputs. Customer support classification goes to the vendor with the lowest latency at p95.
Capability-based routing is the most complex to implement and operate, but it produces the best combination of performance, cost, and resilience. It also distributes vendor concentration risk by design — no single vendor handles your entire AI workload, which limits the blast radius of any individual vendor incident.
The Model Version Registry
Regardless of which redundancy tier you implement, one practice is non-negotiable: maintain a model version registry. This is a document or data store that records every model version currently running in production across every environment, the date that version was deployed, the vendor's stated deprecation date if known, and the migration path to the replacement version.
The model version registry is what transforms a deprecation notice from a crisis into a planned work item. When Mardii detects that a model version you depend on has been given a retirement date, you want to be able to answer immediately: which of our environments depend on this version, what is the replacement, and how much migration work is involved? The registry gives you that answer in minutes rather than hours.
The Monitoring Foundation
Architecture decisions get you to the starting line. Monitoring keeps you in the race. A well-architected redundancy system with no monitoring is a system that silently degrades — failing over to secondary vendors without anyone noticing, accumulating technical debt and cost without triggering any alerts.
Effective AI vendor monitoring covers four layers: API health (are requests succeeding?), behavioral consistency (are response quality and format stable?), cost tracking (is spend trending as expected?), and policy surveillance (have terms or pricing or model lifecycle policies changed?).
The first three layers you can instrument with standard observability tools. The fourth requires monitoring that operates outside your application — watching vendor documentation, Terms of Service, pricing pages, and developer communications for changes that your application will not detect until it breaks.
Mardii handles the fourth layer across OpenAI, Anthropic, Google, Mistral, Cohere, and Perplexity — 24 hours a day, classified by severity and sent to you within minutes of detection. Start free at mardii.com.