> ## Documentation Index
> Fetch the complete documentation index at: https://docs.responsibleailabs.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Building a Responsible AI Chatbot

> Part 1 of 2 - Setup, basic evaluation, deep analysis, and understanding RAIL scores.

<Info>
  **Part 2:** [Production Features](/use-cases/ai-chatbot-production) - Provider wrappers, policy enforcement, sessions, and observability.
</Info>

## The setup

We are building a customer support chatbot for a fictional SaaS product called "CloudDash", a cloud monitoring dashboard. The chatbot answers questions about pricing, features, and troubleshooting. Along the way, we add RAIL Score evaluation at every layer to ensure the chatbot's responses are safe, accurate, fair, and helpful.

### Install dependencies

```bash theme={null}
pip install "rail-score-sdk[openai,google,langfuse]" openai google-genai
```

### Environment variables

Create a `.env` file:

```bash theme={null}
RAIL_API_KEY=YOUR_RAIL_API_KEY
OPENAI_API_KEY=sk-your_openai_key
GEMINI_API_KEY=your_gemini_key

# Optional: for Part 2 (Langfuse observability)
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com
```

<Note>
  **Get your RAIL API key:** Sign up at [responsibleailabs.ai/dashboard](https://responsibleailabs.ai/dashboard). The free tier includes 100 credits to follow this entire tutorial.
</Note>

## Build the basic chatbot

Start with a basic chatbot using OpenAI directly, with no RAIL integration yet. This is the foundation we will layer scoring onto.

```python chatbot.py theme={null}
import openai
import os

openai_client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

SYSTEM_PROMPT = """You are CloudDash Support, a helpful assistant for
CloudDash — a cloud monitoring dashboard. Answer questions about
pricing, features, setup, and troubleshooting. Be concise and accurate.
If you don't know something, say so."""


def chat(user_message: str, history: list[dict] = None) -> str:
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    if history:
        messages.extend(history)
    messages.append({"role": "user", "content": user_message})

    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        temperature=0.3,
    )
    return response.choices[0].message.content


reply = chat("What pricing plans do you offer?")
print(reply)
```

This works, but we have zero visibility into response quality. Is this response safe? Is it factually accurate? Does it contain any bias? We have no way to know, until we add RAIL Score.

## Add RAIL Score evaluation

The simplest way to add RAIL evaluation is with `RailScoreClient`. One call gives us scores across all 8 RAIL dimensions.

```python chatbot_with_eval.py theme={null}
from rail_score_sdk import RailScoreClient
import os

rail = RailScoreClient(api_key=os.getenv("RAIL_API_KEY"))

reply = chat("What pricing plans do you offer?")

result = rail.eval(content=reply, mode="basic")

print(f"Overall Score: {result.rail_score.score}")
print(f"Confidence:    {result.rail_score.confidence}")
print()
for dim_name, dim_score in result.dimension_scores.items():
    print(f"  {dim_name:15s} {dim_score.score}")
```

<Accordion title="Example output">
  ```
  Overall Score: 8.4
  Confidence:    0.91

    fairness        8.5
    safety          9.2
    reliability     7.8
    transparency    8.0
    privacy         5.0
    accountability  8.1
    inclusivity     8.7
    user_impact     9.0
  ```
</Accordion>

### Interpreting the results

| Dimension      | Score | What it means                                           |
| -------------- | ----- | ------------------------------------------------------- |
| Safety         | 9.2   | No harmful content, appropriate for all users           |
| User Impact    | 9.0   | Directly answers the question at the right detail level |
| Inclusivity    | 8.7   | Accessible language, no exclusionary terms              |
| Fairness       | 8.5   | Equitable treatment, no demographic bias                |
| Accountability | 8.1   | Clear reasoning, traceable claims                       |
| Transparency   | 8.0   | Honest representation of knowledge                      |
| Reliability    | 7.8   | Mostly accurate, but pricing details are synthetic      |
| Privacy        | 5.0   | Not applicable - no PII involved                        |

<Tip>
  **Privacy = 5.0** means "not applicable." RAIL returns 5.0 (neutral) when privacy is irrelevant to the content being evaluated.
</Tip>

## Deep evaluation

Basic mode gives you scores. Deep mode gives you the *why*: per-dimension explanations, detected issues, and improvement suggestions.

```python deep_eval.py theme={null}
result = rail.eval(content=reply, mode="deep")

print(f"Overall: {result.rail_score.score}")
print()

for dim_name, detail in result.dimension_scores.items():
    print(f"--- {dim_name} (score: {detail.score}) ---")
    print(f"  Explanation: {detail.explanation}")
    if detail.issues:
        print(f"  Issues: {', '.join(detail.issues)}")
    if detail.suggestions:
        print(f"  Suggestion: {detail.suggestions[0]}")
    print()
```

<Accordion title="Example output">
  ```
  Overall: 8.4

  --- reliability (score: 7.8) ---
    Explanation: The response provides specific pricing figures ($29, $79)
    that appear reasonable but cannot be verified against actual CloudDash
    pricing. The feature breakdown is plausible but unconfirmed.
    Issues: overconfident_claim
    Suggestion: Add a disclaimer that pricing is subject to change, or
    link to the official pricing page for the most current information.

  --- transparency (score: 8.0) ---
    Explanation: The response clearly presents the three tiers with
    distinct features. However, it doesn't disclose that it may not have
    the latest pricing information.
    Issues: concealed_limitation
    Suggestion: Acknowledge that pricing details should be verified on
    the official website.

  --- user_impact (score: 9.0) ---
    Explanation: Directly addresses the user's pricing question with a
    well-structured comparison. The follow-up question adds value.
  ```
</Accordion>

### Basic vs Deep

|                     | Basic                         | Deep                                   |
| ------------------- | ----------------------------- | -------------------------------------- |
| **Scores**          | Overall + 8 dimensions        | Overall + 8 dimensions                 |
| **Explanations**    | No                            | Yes, per dimension                     |
| **Issue detection** | No                            | Yes                                    |
| **Best for**        | High-volume, real-time checks | Debugging, auditing, post-hoc analysis |

<Tip>
  Use basic mode for every response in production, and deep mode selectively. For example, trigger deep mode when a basic score drops below your threshold, or as a periodic audit on a sample of responses.
</Tip>

## What's next

<Card title="Part 2: Production Features" icon="rocket" href="/use-cases/ai-chatbot-production">
  Provider wrappers, policy enforcement (block/regenerate), multi-turn session tracking, and Langfuse observability.
</Card>
