Building a Responsible AI Chatbot

Part 2: Production Features - Provider wrappers, policy enforcement, sessions, and observability.

The setup

We are building a customer support chatbot for a fictional SaaS product called “CloudDash”, a cloud monitoring dashboard. The chatbot answers questions about pricing, features, and troubleshooting. Along the way, we add RAIL Score evaluation at every layer to ensure the chatbot’s responses are safe, accurate, fair, and helpful.

Install dependencies

pip install "rail-score-sdk[openai,google,langfuse]" openai google-genai

Environment variables

Create a .env file:

RAIL_API_KEY=YOUR_RAIL_API_KEY
OPENAI_API_KEY=sk-your_openai_key
GEMINI_API_KEY=your_gemini_key

# Optional: for Part 2 (Langfuse observability)
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com

Get your RAIL API key: Sign up at responsibleailabs.ai/dashboard. The free tier includes 100 credits to follow this entire tutorial.

Build the basic chatbot

Start with a basic chatbot using OpenAI directly, with no RAIL integration yet. This is the foundation we will layer scoring onto.

chatbot.py

import openai
import os

openai_client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

SYSTEM_PROMPT = """You are CloudDash Support, a helpful assistant for
CloudDash — a cloud monitoring dashboard. Answer questions about
pricing, features, setup, and troubleshooting. Be concise and accurate.
If you don't know something, say so."""


def chat(user_message: str, history: list[dict] = None) -> str:
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    if history:
        messages.extend(history)
    messages.append({"role": "user", "content": user_message})

    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        temperature=0.3,
    )
    return response.choices[0].message.content


reply = chat("What pricing plans do you offer?")
print(reply)

This works, but we have zero visibility into response quality. Is this response safe? Is it factually accurate? Does it contain any bias? We have no way to know, until we add RAIL Score.

Add RAIL Score evaluation

The simplest way to add RAIL evaluation is with RailScoreClient. One call gives us scores across all 8 RAIL dimensions.

chatbot_with_eval.py

from rail_score_sdk import RailScoreClient
import os

rail = RailScoreClient(api_key=os.getenv("RAIL_API_KEY"))

reply = chat("What pricing plans do you offer?")

result = rail.eval(content=reply, mode="basic")

print(f"Overall Score: {result.rail_score.score}")
print(f"Confidence:    {result.rail_score.confidence}")
print()
for dim_name, dim_score in result.dimension_scores.items():
    print(f"  {dim_name:15s} {dim_score.score}")

Example output

Overall Score: 8.4
Confidence:    0.91

  fairness        8.5
  safety          9.2
  reliability     7.8
  transparency    8.0
  privacy         5.0
  accountability  8.1
  inclusivity     8.7
  user_impact     9.0

Interpreting the results

Dimension	Score	What it means
Safety	9.2	No harmful content, appropriate for all users
User Impact	9.0	Directly answers the question at the right detail level
Inclusivity	8.7	Accessible language, no exclusionary terms
Fairness	8.5	Equitable treatment, no demographic bias
Accountability	8.1	Clear reasoning, traceable claims
Transparency	8.0	Honest representation of knowledge
Reliability	7.8	Mostly accurate, but pricing details are synthetic
Privacy	5.0	Not applicable - no PII involved

Privacy = 5.0 means “not applicable.” RAIL returns 5.0 (neutral) when privacy is irrelevant to the content being evaluated.

Deep evaluation

Basic mode gives you scores. Deep mode gives you the why: per-dimension explanations, detected issues, and improvement suggestions.

deep_eval.py

result = rail.eval(content=reply, mode="deep")

print(f"Overall: {result.rail_score.score}")
print()

for dim_name, detail in result.dimension_scores.items():
    print(f"--- {dim_name} (score: {detail.score}) ---")
    print(f"  Explanation: {detail.explanation}")
    if detail.issues:
        print(f"  Issues: {', '.join(detail.issues)}")
    if detail.suggestions:
        print(f"  Suggestion: {detail.suggestions[0]}")
    print()

Example output

Overall: 8.4

--- reliability (score: 7.8) ---
  Explanation: The response provides specific pricing figures ($29, $79)
  that appear reasonable but cannot be verified against actual CloudDash
  pricing. The feature breakdown is plausible but unconfirmed.
  Issues: overconfident_claim
  Suggestion: Add a disclaimer that pricing is subject to change, or
  link to the official pricing page for the most current information.

--- transparency (score: 8.0) ---
  Explanation: The response clearly presents the three tiers with
  distinct features. However, it doesn't disclose that it may not have
  the latest pricing information.
  Issues: concealed_limitation
  Suggestion: Acknowledge that pricing details should be verified on
  the official website.

--- user_impact (score: 9.0) ---
  Explanation: Directly addresses the user's pricing question with a
  well-structured comparison. The follow-up question adds value.

Basic vs Deep

	Basic	Deep
Scores	Overall + 8 dimensions	Overall + 8 dimensions
Explanations	No	Yes, per dimension
Issue detection	No	Yes
Best for	High-volume, real-time checks	Debugging, auditing, post-hoc analysis

Use basic mode for every response in production, and deep mode selectively. For example, trigger deep mode when a basic score drops below your threshold, or as a periodic audit on a sample of responses.

What’s next

Part 2: Production Features

Provider wrappers, policy enforcement (block/regenerate), multi-turn session tracking, and Langfuse observability.

​The setup

​Install dependencies

​Environment variables

​Build the basic chatbot

​Add RAIL Score evaluation

​Interpreting the results

​Deep evaluation

​Basic vs Deep

​What’s next

Part 2: Production Features

The setup

Install dependencies

Environment variables

Build the basic chatbot

Add RAIL Score evaluation

Interpreting the results

Deep evaluation

Basic vs Deep

What’s next