跳转到主要内容
Part 2: Production Features - Provider wrappers, policy enforcement, sessions, and observability.

The setup

We are building a customer support chatbot for a fictional SaaS product called “CloudDash”, a cloud monitoring dashboard. The chatbot answers questions about pricing, features, and troubleshooting. Along the way, we add RAIL Score evaluation at every layer to ensure the chatbot’s responses are safe, accurate, fair, and helpful.

Install dependencies

pip install "rail-score-sdk[openai,google,langfuse]" openai google-genai

Environment variables

Create a .env file:
RAIL_API_KEY=YOUR_RAIL_API_KEY
OPENAI_API_KEY=sk-your_openai_key
GEMINI_API_KEY=your_gemini_key

# Optional: for Part 2 (Langfuse observability)
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com
Get your RAIL API key: Sign up at responsibleailabs.ai/dashboard. The free tier includes 100 credits to follow this entire tutorial.

Build the basic chatbot

Start with a basic chatbot using OpenAI directly, with no RAIL integration yet. This is the foundation we will layer scoring onto.
chatbot.py
import openai
import os

openai_client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

SYSTEM_PROMPT = """You are CloudDash Support, a helpful assistant for
CloudDash — a cloud monitoring dashboard. Answer questions about
pricing, features, setup, and troubleshooting. Be concise and accurate.
If you don't know something, say so."""


def chat(user_message: str, history: list[dict] = None) -> str:
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    if history:
        messages.extend(history)
    messages.append({"role": "user", "content": user_message})

    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        temperature=0.3,
    )
    return response.choices[0].message.content


reply = chat("What pricing plans do you offer?")
print(reply)
This works, but we have zero visibility into response quality. Is this response safe? Is it factually accurate? Does it contain any bias? We have no way to know, until we add RAIL Score.

Add RAIL Score evaluation

The simplest way to add RAIL evaluation is with RailScoreClient. One call gives us scores across all 8 RAIL dimensions.
chatbot_with_eval.py
from rail_score_sdk import RailScoreClient
import os

rail = RailScoreClient(api_key=os.getenv("RAIL_API_KEY"))

reply = chat("What pricing plans do you offer?")

result = rail.eval(content=reply, mode="basic")

print(f"Overall Score: {result.rail_score.score}")
print(f"Confidence:    {result.rail_score.confidence}")
print()
for dim_name, dim_score in result.dimension_scores.items():
    print(f"  {dim_name:15s} {dim_score.score}")
Overall Score: 8.4
Confidence:    0.91

  fairness        8.5
  safety          9.2
  reliability     7.8
  transparency    8.0
  privacy         5.0
  accountability  8.1
  inclusivity     8.7
  user_impact     9.0

Interpreting the results

DimensionScoreWhat it means
Safety9.2No harmful content, appropriate for all users
User Impact9.0Directly answers the question at the right detail level
Inclusivity8.7Accessible language, no exclusionary terms
Fairness8.5Equitable treatment, no demographic bias
Accountability8.1Clear reasoning, traceable claims
Transparency8.0Honest representation of knowledge
Reliability7.8Mostly accurate, but pricing details are synthetic
Privacy5.0Not applicable - no PII involved
Privacy = 5.0 means “not applicable.” RAIL returns 5.0 (neutral) when privacy is irrelevant to the content being evaluated.

Deep evaluation

Basic mode gives you scores. Deep mode gives you the why: per-dimension explanations, detected issues, and improvement suggestions.
deep_eval.py
result = rail.eval(content=reply, mode="deep")

print(f"Overall: {result.rail_score.score}")
print()

for dim_name, detail in result.dimension_scores.items():
    print(f"--- {dim_name} (score: {detail.score}) ---")
    print(f"  Explanation: {detail.explanation}")
    if detail.issues:
        print(f"  Issues: {', '.join(detail.issues)}")
    if detail.suggestions:
        print(f"  Suggestion: {detail.suggestions[0]}")
    print()
Overall: 8.4

--- reliability (score: 7.8) ---
  Explanation: The response provides specific pricing figures ($29, $79)
  that appear reasonable but cannot be verified against actual CloudDash
  pricing. The feature breakdown is plausible but unconfirmed.
  Issues: overconfident_claim
  Suggestion: Add a disclaimer that pricing is subject to change, or
  link to the official pricing page for the most current information.

--- transparency (score: 8.0) ---
  Explanation: The response clearly presents the three tiers with
  distinct features. However, it doesn't disclose that it may not have
  the latest pricing information.
  Issues: concealed_limitation
  Suggestion: Acknowledge that pricing details should be verified on
  the official website.

--- user_impact (score: 9.0) ---
  Explanation: Directly addresses the user's pricing question with a
  well-structured comparison. The follow-up question adds value.

Basic vs Deep

BasicDeep
Cost1 credit3 credits
ScoresOverall + 8 dimensionsOverall + 8 dimensions
ExplanationsNoYes, per dimension
Issue detectionNoYes
Best forHigh-volume, real-time checksDebugging, auditing, post-hoc analysis
Cost-saving tip: Use basic mode for every response in production, and deep mode selectively. For example, trigger deep mode when a basic score drops below your threshold, or as a periodic audit on a sample of responses.

What’s next

Part 2: Production Features

Provider wrappers, policy enforcement (block/regenerate), multi-turn session tracking, and Langfuse observability.