मुख्य सामग्री पर जाएं
Part 2: Production Features - Provider wrappers, policy enforcement, sessions, और observability।

Setup

हम “CloudDash” नाम के एक fictional SaaS product के लिए customer support chatbot बना रहे हैं — यह एक cloud monitoring dashboard है। यह chatbot pricing, features, और troubleshooting के बारे में सवालों के जवाब देता है। साथ ही, हम हर layer पर RAIL Score evaluation add करेंगे ताकि chatbot के responses safe, accurate, fair, और helpful हों।

Dependencies install करें

pip install "rail-score-sdk[openai,google,langfuse]" openai google-genai

Environment variables

एक .env file बनाएँ:
RAIL_API_KEY=YOUR_RAIL_API_KEY
OPENAI_API_KEY=sk-your_openai_key
GEMINI_API_KEY=your_gemini_key

# Optional: Part 2 के लिए (Langfuse observability)
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com
अपनी RAIL API key लें: responsibleailabs.ai/dashboard पर sign up करें। Free tier में 100 credits मिलते हैं — इस पूरे tutorial को follow करने के लिए काफ़ी हैं।

Basic chatbot बनाएँ

पहले OpenAI से directly एक basic chatbot बनाएँ, बिना किसी RAIL integration के। यही वो foundation है जिस पर हम scoring layer करेंगे।
chatbot.py
import openai
import os

openai_client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

SYSTEM_PROMPT = """You are CloudDash Support, a helpful assistant for
CloudDash — a cloud monitoring dashboard. Answer questions about
pricing, features, setup, and troubleshooting. Be concise and accurate.
If you don't know something, say so."""


def chat(user_message: str, history: list[dict] = None) -> str:
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    if history:
        messages.extend(history)
    messages.append({"role": "user", "content": user_message})

    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        temperature=0.3,
    )
    return response.choices[0].message.content


reply = chat("What pricing plans do you offer?")
print(reply)
यह काम तो करता है, लेकिन response quality की कोई visibility नहीं है। क्या यह response safe है? क्या factually accurate है? कोई bias तो नहीं है? जब तक RAIL Score add नहीं करते, जानने का कोई तरीका नहीं।

RAIL Score evaluation add करें

RAIL evaluation add करने का सबसे simple तरीका RailScoreClient use करना है। एक ही call से सभी 8 RAIL dimensions के scores मिल जाते हैं।
chatbot_with_eval.py
from rail_score_sdk import RailScoreClient
import os

rail = RailScoreClient(api_key=os.getenv("RAIL_API_KEY"))

reply = chat("What pricing plans do you offer?")

result = rail.eval(content=reply, mode="basic")

print(f"Overall Score: {result.rail_score.score}")
print(f"Confidence:    {result.rail_score.confidence}")
print()
for dim_name, dim_score in result.dimension_scores.items():
    print(f"  {dim_name:15s} {dim_score.score}")
Overall Score: 8.4
Confidence:    0.91

  fairness        8.5
  safety          9.2
  reliability     7.8
  transparency    8.0
  privacy         5.0
  accountability  8.1
  inclusivity     8.7
  user_impact     9.0

Results को समझें

DimensionScoreइसका मतलब
Safety9.2कोई harmful content नहीं, सभी users के लिए appropriate
User Impact9.0सही detail level पर directly सवाल का जवाब देता है
Inclusivity8.7Accessible language, कोई exclusionary terms नहीं
Fairness8.5Equal treatment, कोई demographic bias नहीं
Accountability8.1Clear reasoning, traceable claims
Transparency8.0Knowledge का honest representation
Reliability7.8ज़्यादातर accurate, लेकिन pricing details synthetic हैं
Privacy5.0Applicable नहीं - कोई PII involved नहीं
Privacy = 5.0 का मतलब है “applicable नहीं।” जब privacy evaluate किए जा रहे content के लिए relevant नहीं होती, तो RAIL 5.0 (neutral) return करता है।

Deep evaluation

Basic mode आपको scores देता है। Deep mode आपको why बताता है: हर dimension की explanations, detected issues, और improvement suggestions।
deep_eval.py
result = rail.eval(content=reply, mode="deep")

print(f"Overall: {result.rail_score.score}")
print()

for dim_name, detail in result.dimension_scores.items():
    print(f"--- {dim_name} (score: {detail.score}) ---")
    print(f"  Explanation: {detail.explanation}")
    if detail.issues:
        print(f"  Issues: {', '.join(detail.issues)}")
    if detail.suggestions:
        print(f"  Suggestion: {detail.suggestions[0]}")
    print()
Overall: 8.4

--- reliability (score: 7.8) ---
  Explanation: The response provides specific pricing figures ($29, $79)
  that appear reasonable but cannot be verified against actual CloudDash
  pricing. The feature breakdown is plausible but unconfirmed.
  Issues: overconfident_claim
  Suggestion: Add a disclaimer that pricing is subject to change, or
  link to the official pricing page for the most current information.

--- transparency (score: 8.0) ---
  Explanation: The response clearly presents the three tiers with
  distinct features. However, it doesn't disclose that it may not have
  the latest pricing information.
  Issues: concealed_limitation
  Suggestion: Acknowledge that pricing details should be verified on
  the official website.

--- user_impact (score: 9.0) ---
  Explanation: Directly addresses the user's pricing question with a
  well-structured comparison. The follow-up question adds value.

Basic vs Deep

BasicDeep
Cost1 credit3 credits
ScoresOverall + 8 dimensionsOverall + 8 dimensions
Explanationsनहींहाँ, हर dimension के लिए
Issue detectionनहींहाँ
Best forHigh-volume, real-time checksDebugging, auditing, post-hoc analysis
Cost-saving tip: Production में हर response के लिए basic mode use करें, और deep mode selectively use करें। जैसे, जब basic score आपके threshold से नीचे गिरे तो deep mode trigger करें, या responses के sample पर periodic audit के लिए।

आगे क्या है

Part 2: Production Features

Provider wrappers, policy enforcement (block/regenerate), multi-turn session tracking, और Langfuse observability।