AVRS Methodology: How We Calculate AI Vendor Risk Scores

Every AI vendor risk score you have seen before this one is either a subjective analyst rating or a single-dimension metric — uptime percentage, or pricing trend, or some editorial sentiment score from a technology publication. None of them tell you what you actually need to know: how likely is this vendor to do something that costs your team real money, and how soon?

The AI Vendor Risk Score (AVRS) is Mardii's answer to that question. It is a composite 0–100 score computed daily from five independently modeled risk dimensions, weighted by their historical predictive value for downstream engineering incidents. Here is exactly how it works.

Why a Composite Score

A vendor can have excellent uptime and terrible pricing stability. A vendor can have stable pricing and aggressive ToS drift that quietly removes capabilities your product depends on. A vendor can score well on every quantitative dimension and still deprecate your model checkpoint with 12 days notice because model lifecycle governance is a separate organizational function from infrastructure reliability.

Single-dimension scores do not capture this. A composite score forces every dimension into a unified signal. The weighting decisions — how much each dimension contributes — reflect the relative frequency and severity with which each dimension has historically driven engineering incidents and unplanned costs.

The Five Modules

Module 1: Cost Volatility Index (30%)

The Cost Volatility Index measures pricing stability over a trailing 12-month window. It captures three distinct phenomena: magnitude of price changes, frequency of price changes, and directional consistency. A vendor that raises prices once by 20% scores differently from a vendor that oscillates — cutting prices to gain market share and then raising them after adoption — even if the net change is identical.

Oscillation is treated as higher risk than monotonic change because it signals strategic instability, not just pricing decisions. A vendor whose pricing oscillates is a vendor whose unit economics are not settled, which correlates with a higher probability of future changes that are difficult to forecast.

Current CVI scores: OpenAI scores 58 — significant volatility driven by the GPT-4 to GPT-4o transition pricing and the subsequent context window repricing events. Groq scores 81 — the most stable pricing trajectory of any monitored provider.

Module 2: Operational Stability Score (25%)

Operational stability is computed from uptime, latency distribution, and error rate across a trailing 30-day window, sampled every 60 seconds. Raw uptime percentage is not the primary metric — what matters is the statistical distribution of degradation events.

A vendor with 99.9% uptime and two complete outages of 43 minutes each is materially different from a vendor with 99.9% uptime and 864 brief 6-second timeouts distributed across the month. The first vendor will surface clearly in incident reports. The second will silently accumulate retry costs, timeout-related data loss, and user-facing latency spikes that are individually invisible but collectively significant.

Mardii pings each provider's API endpoints every 60 seconds with lightweight diagnostic requests. We track p50, p95, and p99 latency, measure error rate by error code, and score degradation events by duration and recoverability. The result is an operational stability score that reflects what developers actually experience, not what shows on a status page that requires a human to update.

Module 3: Policy Drift Severity Index (20%)

Terms of Service, acceptable use policies, data retention agreements, and API usage policies are hashed daily using SHA-256. When the hash changes, Mardii diffs the full document and classifies each changed clause by its potential impact on developers: Breaking (directly removes or restricts a capability), Major (changes conditions of use), or Minor (clarification with no practical impact).

The Policy Drift Severity Index is a weighted count of policy changes over the trailing 12 months, with Breaking changes weighted 5×, Major 2×, and Minor 0.5×. This captures both the frequency and seriousness of policy evolution.

Anthropic currently scores lowest on this dimension — not because it changes its policies rarely, but because when it does, the changes tend to be Major rather than Breaking. The December 2025 OAuth restriction was classified as Breaking, which moved their score significantly. OpenAI scores poorly here due to the high frequency of plugin policy and capability restriction updates that accumulated through 2025.

Module 4: Model Stability Score (15%)

Model stability tracks deprecation behavior: how many model versions has this vendor retired in the trailing 12 months, what were the effective notice periods, and how consistent has the vendor been with its stated deprecation policy?

The notice period consistency metric is particularly important. A vendor whose actual notice periods consistently match or exceed stated policy scores higher than a vendor with generous stated policy that it frequently violates. Engineering teams plan based on documented policies. Actual behavior is what determines whether that planning is sufficient.

Mistral currently leads this dimension with the most stable model lifecycle governance — fewer retirements, longer effective notice periods, and high consistency between stated and actual behavior. OpenAI scores lowest, driven by the density of checkpoint-level deprecations across the GPT-4 and GPT-4o model families.

Module 5: Ecosystem Fragility Index (10%)

The Ecosystem Fragility Index measures market concentration risk. A vendor that controls 65% of a specific capability category — such as OpenAI's position in instruction-following general-purpose models — represents a different risk profile from a vendor operating in a competitive segment with three credible substitutes.

This dimension captures something that purely vendor-side metrics miss: your risk is not just a function of what one vendor does, but of how replaceable they are if they do something bad. High market concentration means higher switching friction, which amplifies the impact of every other risk dimension.

Interpreting the Score

AVRS is calibrated so that scores above 85 represent vendors with historically low incident rates across all five dimensions — providers where the probability of an unplanned engineering event in any 30-day window, given historical behavior, is below 5%. Scores between 65 and 85 represent vendors with moderate risk in at least one dimension. Scores below 65 indicate elevated risk requiring active monitoring and contingency planning.

Current scores as of February 2026: Groq 91 (Stable), Mistral 84 (Watch), Anthropic 72 (Elevated), Google 61 (Elevated), OpenAI 54 (Elevated). Cohere and Perplexity scores are published on Mardii's provider pages.

The scores update daily. They shift when providers make changes — a deprecation notice drops Model Stability, a ToS amendment affects Policy Drift, an outage affects Operational Stability. They also shift as time passes without incidents, allowing providers to improve their scores through sustained stable behavior.

What AVRS Is Not

AVRS is not a quality score. A vendor with an AVRS of 54 may offer the best model performance for your use case. AVRS measures governance and operational behavior, not model capability. The two dimensions are largely orthogonal — the providers with the highest capability scores tend to score lower on AVRS because they are also the providers moving fastest, making the most changes, and operating with the most complex product surfaces.

AVRS is also not a recommendation to avoid specific vendors. It is a signal that informs how much monitoring, contingency planning, and vendor redundancy is appropriate given your exposure to that vendor. An AVRS of 54 on your primary AI vendor means you should have a fallback plan and real-time monitoring. It does not mean you should stop using them.

Mardii publishes AVRS scores for all six monitored providers at mardii.com/providers. Start free to receive alerts when any score changes significantly.