How to Catch an LLM by the Tail: Effective Testing Strategies for AI Models

How do you ensure your LLM won't spout complete nonsense at the worst possible moment? How can you verify that it truly understands context, rather than just generating beautiful but meaningless phrases? And most importantly, how can you do this efficiently without spending weeks manually checking thousands of responses?

Introduction

Today technologies are developing really fast. Traditional development methods are taking a back seat. Large Language Models (LLMs) and their applications are stepping into the spotlight.

LLMs are already transforming many areas of life: medicine, education, business. These models are becoming more powerful and smarter every day.

But we face new challenges. How do we properly develop such systems? How do we test them to ensure they work reliably? To tackle this, we need new approaches and a deep understanding of LLMs.

Let's explore how these models work and what the future holds for us.

There are three main pillars of LLM testing:

User feedback: The most obvious, but delayed way to understand if your application works. After all, feedback is only possible after release. But what if errors are hidden until production?
Manual testing: A labor-intensive task that can exhaust even the most energetic team. You probably already know the pros and cons of applying it in practice: subjectivity and, sure, errors too.
Automated testing: A safe harbor for all LLM developers. However, there's a catch — it's still evolving, just like LLMs themselves, and always requires fresh approaches.

To test LLMs automatically, you can use different approaches:

Classic tests
Ready-made metrics
Custom metrics

Classic Tests

When we first dived into developing LLM applications, our adventure began with exploring Pydantic classes in Python. Yes, Pydantic is like a magic pill for turning the chaos of model responses into something more predictable and deterministic. With them, you can set up the entire set of attributes we expect to see.

Name
Type
Required or not
Description: a brief note on why this attribute is needed
Default value: if the attribute is optional

Here's what one of our classic examples looks like:

class Appointment(BaseModel):
    time: str = Field(..., description="Date and time of meeting in format YYYY-MM-DDTHH:MM:SS")
    duration: int = Field(description="Duration of the meeting in minutes", default=60)
    title: str = Field(..., description="Meeting title in format: Interview [Name Surname] (of candidate)")
    description: str = Field(..., description=f"Link to the meeting in format: meeting/[name_surname] (of candidate)")

What do we see here? Everything inside this class is like instructions to the model about what it should aim for. Moreover, you can connect a validator that will again check if the model's response meets all the requirements for keys and types.

Now about tests. For regular assert tests to confirm that the model's response is correct, you need to know in advance exactly what response is expected. For example:

Imagine you want the model to figure out how many animals grandma has in the tale of three geese. Then you'd describe the model's response like this:

class ModelOutput(BaseModel):
    pets: int = Field(..., description='Number of pets')

Let's say you give the model the input 'Grandma had three merry geese', then your assert would look like this:

assert actual_output == ModelOutput(pets=3)

Sounds simple, right? This way you can evaluate not only numbers but also strings if you have an idea of what the pattern should be. Regular expressions are a great way to handle checks for specific patterns.

This type of testing is a real lifesaver when starting to work with LLMs. But remember: success here depends on your skill in writing clear and detailed instructions for your tasks.

Testing with Basic Metrics

Basic Metrics

Although LLM testing is still essentially a new field, there's already a whole arsenal of metrics that help figure out how well your model handles tasks. Why text metrics specifically? Because text is the foundation of LLM responses. Here are the most basic metrics we've dealt with:

Summarization
Response relevance
Truthfulness
Hallucination

In fact, there's a huge variety of metrics for evaluating text, and if desired, you can find something for any task using ready-made resources. But here's the catch: how do you incorporate them into tests? For this, we found two main frameworks:

DeepEval
LangSmith Testing

These frameworks are your faithful assistants in LLM testing. Each has its own features and is suitable for different needs.

LangSmith Testing

LangChain and its huge community is perhaps the most popular framework for creating LLM applications. It became our team's main tool for creating language chains, so we started testing with it. But, we must admit that everything turned out to be not as easy and convenient as we had hoped.

First, let's talk about the advantages we discovered:

Ability to create datasets for specific tasks
Ease of writing experiments for specific datasets
All experiments and datasets are saved, allowing you to track changes over time
If you register your chains and agents in LangSmith, you can link a dataset and experiment to a specific chain
Ability to version datasets - allows testing all saved versions of your application

Overall, it's a decent testing tool if your tasks aren't too specific. However, we encountered a number of difficulties that we decided not to sidestep.

The first problem is the lack of ability to manage datasets from code. You can create datasets and add examples to them, but you can't automatically delete them or specific elements. This means that if your team works from several accounts, each member needs to manually create or delete datasets before running an experiment. If the dataset already exists, all elements will be created anew when the experiment is run, and the statistics will be skewed.
The next problem is lack of asynchronous support. When creating an experiment:

evaluators = [
    LangChainStringEvaluator("cot_qa")
]

results = evaluate(
    chain.invoke,
    data=data,
    evaluators=evaluators,
    experiment_prefix=experiment_prefix,
)

We found that in the evaluate method, you need to explicitly call the chain via invoke. But what if your chain is asynchronous? Changing asynchronicity to synchronicity for the sake of tests is not ideal. We had to write a special wrapper to wrap ainvoke in a synchronous function. But even here we ran into adventures. LangChainStringEvaluator creates its own event loop, which needs to be intercepted to launch the chain.

Here's how we managed to implement our asynchronous chain:

nest_asyncio.apply()
event_loop = asyncio.get_event_loop()

def sync_invoke_wrapper(inputs):
    try:
        result = event_loop.run_until_complete(chain.ainvoke(inputs))
        return result
    except Exception as error:
        raise error

In LangSmith, the set of metrics is quite limited. The full list can be found in the documentation, but in our opinion, these metrics are not enough to test applications from a business perspective. You can test the quality of the text, but it looks more like a linguistic assessment and nothing more.

If the presented metrics are enough for you, then this method is quite workable. However, our team decided not to use this product at this stage of its development.

DeepEval Testing

Although DeepEval is not as popular as LangSmith, it became our choice in LLM testing.

Here are the advantages we found in this framework:

Dataset support. DeepEval, like LangSmith, allows you to create datasets. But most importantly, in DeepEval you can fully manage them right from the code. All the problems we talked about in the context of LangSmith simply disappear here.
Abundance of ready-made metrics. DeepEval has many ready-made metrics, which sets it apart favorably. In LangSmith, metrics mainly evaluate text from a linguistic point of view, which is not always suitable for business tasks. DeepEval metrics are more focused on assessing the usefulness of the response in specific cases.
However, they work correctly only in classic scenarios. For example, a metric for summarization will adequately evaluate only traditional brief summaries. If analysis or something specific is added to the summary, the metric may behave unstably or not work at all.
Ease of writing test cases. In DeepEval, it's easy to write test cases, as you can simply borrow them from the documentation for a specific metric. However - pay attention to what parameters it accepts. To understand exactly what to pass, it's worth studying the description of how the metric works. To achieve the most accurate result, you need to provide the test case with full context. If your model answers a question considering information from somewhere other than the user prompt, this information must also be passed to the test case. Otherwise, the answer will be evaluated as containing unnecessary information.

If you're getting unexpectedly low scores when using DeepEval metrics, try the following:

Carefully study the reason that the metric gives along with the score.
If the reason indicates a lack of context, try changing or adding parameters.
If the context is complete, but the score still doesn't meet expectations, try using a custom GEval metric.

A More Technical Comparison of the Two Frameworks

(we've done all the Googling for you)

Feature	LangSmith	DeepEval
Production Monitoring and Debugging	Comprehensively supports monitoring and data visualization in production for LLM evaluation. Provides the ability to track logs and add test data	Doesn't focus on production monitoring, but integrates with existing frameworks for local unit testing and model analysis
Dataset Creation and Storage	Supports creation and storage of test datasets, allowing addition of examples from real application usage	Works with test datasets for unit testing and hyperparameter optimization, but focuses more on checking specific cases than on storage
Evaluation Metrics	Integration with Ragas for metrics of faithfulness, answer relevance, and context. Helps explain the results of these metrics and makes them reproducible	Provides a set of metrics for evaluating LLM responses, but with a focus on flexibility and integration with popular ML frameworks for precise evaluation
Integration with Other Systems	Integrates well with LangChain for debugging and monitoring chains, including support for multimodal tasks and complex QA pipelines	Open integration with popular LLM and ML frameworks, allowing flexible connection of various sources for conducting evaluations and tests
Automated Testing	Focuses on continuous automated testing in production using the platform	Supports automated testing of models and their configurations, but is more geared towards offline checks and optimization in development

The choice is yours!

Testing with Custom Metrics

GEval Metrics

While working with LLMs, we realized that testing models from a business value perspective is not such an obvious thing. Classic tests and ready-made metrics don't always fit.

Here's a simple scenario, for example: checking the language of the model's response. If the model is given a system prompt and context in English, and the user's question can be in any language, there's a chance the model will respond in English, even with explicit instructions to "respond in the user's language". Visually, this is easy to check, but what if there are thousands of requests? This is where custom GEval metrics come to the rescue.

What exactly is GEval?

GEval is a metric from the DeepEval set.

It evaluates the model's response according to criteria you set, giving a score from 0 to 1. Unlike ready-made metrics where criteria are already written, here you can set your own evaluation logic. Want to check the language of the response? Or the number of sentences? Maybe the style and tone of communication? GEval will give you that opportunity.

How to create such a metric?

To create your metric, implement it through the GEval class:

GEval(
    name="Your metric name",
    criteria="Evaluation criteria"
    evaluation_steps=[
        "Step 1",
        "Step 2",
        "Step 3"
    ],
    verbose_mode=True,
    threshold=0.7,
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
    ],
)

You can use either criteria or steps:

Criteria: the model itself will outline evaluation steps based on the criteria.
Steps: you specify sequential actions that the model will perform to give a score. This is a more detailed and sequential approach.

What to choose?

When we first started using GEval, we used steps. We thought it was a cleaner and more sequential approach. BUT! We weren't quite right. After studying the official GEval article in detail, we realized that steps aren't as simple to write as they seem. Steps are not just a sequence of actions, but clear instructions for the model that force it to act within the Chain of Thought framework.

Therefore, here's how we recommend working with GEval:

Start with criteria
Look at how the model interprets your requirements in the logs, and adjust the criteria if needed
When you see steps that perfectly fit your case, simply copy them and paste them into evaluation_steps

This approach will give you the greatest stability for your metric, as steps won't be generated anew with each run.

How to write good criteria?

Define the boundaries of the metric: What exactly should be evaluated? Don't mix checking several parameters in one metric.
Define the ideal answer: List point by point what answer you expect. For example, for a recipe: list of ingredients, preparation, cooking steps, serving options.
Write the criteria. The criteria should describe what the reference answer should be like. For example: "The answer is written in English. The answer is coherent, sequential, and contains no unnecessary sentences."

What parameters to pass to the metric?

Each metric necessarily takes input and output to establish a question-answer relationship. Additional parameters may include context, expected response, and so on, based on your case. This is needed for full evaluation context.

What is Threshold and how to set it

Threshold is the minimum metric score that is considered successful when passing the test. Simply put, it's a threshold value from 0 to 1, where closer to 1 means the model's answer better meets the given criteria.

To be honest, we're still figuring out these values ourselves. The thing is, one unstable model is evaluating another unstable model. So, even if your answer is perfect, don't expect it to always get that coveted one.

So how do we deal with this threshold?

The model, in addition to the score, always gives an explanation (reason) where the pros and cons of the answer are listed. If there are no indications of shortcomings in the reason, then the score can be considered positive.

Here's how we determine the threshold:

Create a knowingly bad answer
Pass it through the metric and analyze the score and reason
Gradually improve the bad answer, watching the score increase
Once the answer reaches the minimum acceptable level for our case, we fix its score

This score will become your approximate threshold.

We haven't yet managed to develop a clear understanding of how exactly the framework creators intended to work with the threshold. Perhaps this information will appear in the documentation over time. For now, we're going through trial and error and learning from our own experience.

How to Choose an Approach

Choosing an approach is perhaps the most difficult and important task when testing LLMs. Here's what we usually do to decide:

Review the acceptance criteria: We go back to the main criteria developed at the product or story planning stage.
Write out specific requirements: We determine what standards the model's response, chain, or agent should meet.
Write out ways to check each requirement: For example, if the requirements for the answer are:
- json with keys name, age, hobbies
- name should be in nominative case with a capital letter
- age should be a positive integer
- hobbies should be a list of strings
- all items should be truthful and correspond to the input text
- hobbies should include all those listed in the text
- key contents should be in the language of the request
Then the ways to test each point will be:
- json with keys name, age, hobbies - simple object matching test
- name should be in nominative case with a capital letter - custom metric
- age should be a positive integer - simple type test
- hobbies should be a list of strings - simple type test
- all items should be truthful and correspond to the input text - hallucination
- hobbies should include all those listed in the text - custom metric
- key contents should be in the language of the request - custom metric
After this, we proceed to implement all necessary tests, following the rules described in the previous sections.

Analysis of a Real Example

We want to develop a chain that will:

Take a children's fairy tale as input
Analyze the fairy tale
Return to us the name of the main character and their description

So, let's start with the criteria we expect from the answer:

The answer includes a description of the character's personality
The personality description is written by the model based on analysis of the fairy tale, it's not necessarily present in the fairy tale itself verbatim
The answer includes the character's name
The character's name is written with a capital letter in the nominative case

We wrote some prompt for the model and composed a chain. First, let's write a class for the model's response format:

class Character(BaseModel):
    name: str = Field(..., description="Character's name in nom. case and capitalized")
    description: str = Field(..., description="Character description")

Now let's decide which methods we'll use to evaluate each response criterion:

classic test for type and non-empty value
custom metric
classic test for type and non-empty value
classic test for value matching

Let's write tests for each criterion:

fairy_tale = "There is a goose Ivan. Ivan is a very communicative goose"


#Check that class has NAME and DESCRIPTION present and not empty
def test_output():
    result: Character = chain.invoke({'fairytale': fairy_tale})
    assert result.name
    assert result.description


#Check that NAME and DESCRIPTION are strings
def test_output_types():
    result: Character = chain.invoke({'fairytale': fairy_tale})
    assert isinstance(result.name, str)
    assert isinstance(result.description, str)


#Check that character description is relevant with fairy tale
def test_character_description():
    result: Character = chain.invoke({'fairytale': fairy_tale})
    description = result.description

    metric = GEval(
        name="Character description corectness",
        criteria="Check that the character description is correct and matches the text of the fairy tale",
        verbose_mode=True,
        threshold=0.7,
        evaluation_params=[
            LLMTestCaseParams.INPUT,
            LLMTestCaseParams.ACTUAL_OUTPUT,
        ],
    )

    assert_test(
        test_case=LLMTestCase(
            input=f"Fairy tail: {fairy_tale}",
            actual_output=description
        ),
        metrics=[metric]
    )

#Check that name is correct and in nom. case and capitalized
def test_name():
    result: Character = chain.invoke({'fairytale': fairy_tale})
    name = result.name
    assert name == 'Ivan'

Link to Colab with code

The threshold here is set to 0.7 - for example. In real tests, use the logic we talked about for setting it, or try your own!

That's it! Now your applications are reliably protected by tests that can be updated and added, and at the same time - you can always keep an eye on their green color!

If You Want to Dive Deeper

Official framework documentation: LangSmith testing, DeepEval

GEval documentation: GEval

Articles about testing: Testing of LLM models, Evaluating Outputs , How to Test LLM Applications