Evaluation - RAIL Score

Evaluation is the foundation of the RAIL Score system. Every other feature depends on evaluation scores.

API endpoint: POST /railscore/v1/eval | Python: client.eval() | JavaScript: client.eval()

The 8 RAIL dimensions

Dimension	What it measures
Fairness	Equitable treatment across demographics. No bias or stereotyping.
Safety	Absence of harmful, toxic, or dangerous content.
Reliability	Factual accuracy, internal consistency, appropriate calibration.
Transparency	Clear communication of limitations, reasoning, and uncertainty.
Privacy	Protection of personal information and data minimization.
Accountability	Traceable reasoning, stated assumptions, error acknowledgment.
Inclusivity	Inclusive language, accessibility, cultural awareness.
User Impact	Positive value delivered at the right detail level and tone.

For the full definition of each dimension, its score anchors, and worked examples, see The RAIL Framework.

Basic, deep, and auto modes

All modes score the same 8 dimensions and return the same overall RAIL score. They differ in depth and what detail comes back.

Basic mode
Deep mode
Auto mode

RAIL’s core scoring models. Fast (typically under a second) and built for real-time scoring in production.Returns: overall score, per-dimension scores, and confidence values.

result = client.eval(content="Your text here", mode="basic")
# result.rail_score.score       -> 7.6
# result.dimension_scores       -> {fairness: {score: 7.7, confidence: 0.84}, ...}

A deeper, more detailed analysis of the content. Takes a few seconds and, on top of the scores, can return a per-dimension explanation, issue tags, and improvement suggestions.

result = client.eval(
    content="Your text here",
    mode="deep",
    include_explanations=True,
    include_issues=True,
    include_suggestions=True,
)
# result.dimension_scores["transparency"].explanation -> "The process is mostly clear, but..."
# result.dimension_scores["safety"].issues            -> ["Potential phishing risks"]

Runs basic on every request, and escalates to deep only when a real issue is detected — a low-scoring or low-confidence dimension, or a flagged signal. Clean content stays fast and cheap; content that needs scrutiny automatically gets the deeper analysis.

result = client.eval(content="Your text here", mode="auto")
# result.resolved_mode -> "basic"  (clean content — stayed fast)
#                       -> "deep"   (issue detected — escalated)
# result.escalated      -> False / True

resolved_mode and escalated in the response result tell you which tier ran. You’re billed at the tier that actually ran.

Which to use: reach for basic when you score on the hot path of a production request and want a fast verdict. Reach for deep when you need to show a reviewer why something scored low, or when you are debugging and tuning a policy, because it returns explanations and issue tags. Reach for auto when you want basic’s speed on most traffic but automatic deep analysis on the content that needs it — without deciding upfront.

The response

Every evaluation returns:

rail_score — the overall score (0–10), its confidence, and a one-line summary.
dimension_scores — a score and confidence for each of the 8 dimensions. In deep mode each also carries an explanation and issues (and suggestions when requested).
policy_outcome — how your application’s policy judged the result.

Selective dimensions

result = client.eval(
    content="Your text here",
    mode="basic",
    dimensions=["safety", "privacy", "reliability"],
)

Custom weights

Weights must sum to 100:

result = client.eval(
    content="Patient should take 500mg ibuprofen every 4 hours.",
    mode="deep",
    domain="healthcare",
    weights={
        "safety": 25, "privacy": 20, "reliability": 20,
        "accountability": 15, "transparency": 10,
        "fairness": 5, "inclusivity": 3, "user_impact": 2,
    },
)

Score tiers

A score maps to one of five bands, from Excellent (9.0–10.0) down to Critical (0.0–2.9). See The RAIL Framework for the full table and what each band means.

Caching

Identical requests return cached results, so repeated scoring of the same content is fast and not re-charged. Basic mode caches for 5 minutes, deep mode for 3 minutes.

API Reference: Evaluation

Full endpoint specification

Python SDK: Evaluation

Python code examples

​The 8 RAIL dimensions

​Basic, deep, and auto modes

​The response

​Selective dimensions

​Custom weights

​Score tiers

​Caching

API Reference: Evaluation

Python SDK: Evaluation

The 8 RAIL dimensions

Basic, deep, and auto modes

The response

Selective dimensions

Custom weights

Score tiers

Caching