Skip to main content
Evaluation is the foundation of the RAIL Score system. Every other feature depends on evaluation scores.
API endpoint: POST /railscore/v1/eval | Python: client.eval() | JavaScript: client.eval()

The 8 RAIL dimensions

DimensionWhat it measures
FairnessEquitable treatment across demographics. No bias or stereotyping.
SafetyAbsence of harmful, toxic, or dangerous content.
ReliabilityFactual accuracy, internal consistency, appropriate calibration.
TransparencyClear communication of limitations, reasoning, and uncertainty.
PrivacyProtection of personal information and data minimization.
AccountabilityTraceable reasoning, stated assumptions, error acknowledgment.
InclusivityInclusive language, accessibility, cultural awareness.
User ImpactPositive value delivered at the right detail level and tone.
For the full definition of each dimension, its score anchors, and worked examples, see The RAIL Framework.

Basic, deep, and auto modes

All modes score the same 8 dimensions and return the same overall RAIL score. They differ in depth and what detail comes back.
RAIL’s core scoring models. Fast (typically under a second) and built for real-time scoring in production.Returns: overall score, per-dimension scores, and confidence values.
result = client.eval(content="Your text here", mode="basic")
# result.rail_score.score       -> 7.6
# result.dimension_scores       -> {fairness: {score: 7.7, confidence: 0.84}, ...}
Which to use: reach for basic when you score on the hot path of a production request and want a fast verdict. Reach for deep when you need to show a reviewer why something scored low, or when you are debugging and tuning a policy, because it returns explanations and issue tags. Reach for auto when you want basic’s speed on most traffic but automatic deep analysis on the content that needs it — without deciding upfront.

The response

Every evaluation returns:
  • rail_score — the overall score (0–10), its confidence, and a one-line summary.
  • dimension_scores — a score and confidence for each of the 8 dimensions. In deep mode each also carries an explanation and issues (and suggestions when requested).
  • policy_outcome — how your application’s policy judged the result.

Selective dimensions

result = client.eval(
    content="Your text here",
    mode="basic",
    dimensions=["safety", "privacy", "reliability"],
)

Custom weights

Weights must sum to 100:
result = client.eval(
    content="Patient should take 500mg ibuprofen every 4 hours.",
    mode="deep",
    domain="healthcare",
    weights={
        "safety": 25, "privacy": 20, "reliability": 20,
        "accountability": 15, "transparency": 10,
        "fairness": 5, "inclusivity": 3, "user_impact": 2,
    },
)

Score tiers

A score maps to one of five bands, from Excellent (9.0–10.0) down to Critical (0.0–2.9). See The RAIL Framework for the full table and what each band means.

Caching

Identical requests return cached results, so repeated scoring of the same content is fast and not re-charged. Basic mode caches for 5 minutes, deep mode for 3 minutes.

API Reference: Evaluation

Full endpoint specification

Python SDK: Evaluation

Python code examples