Behavioural Testing ML models using ‘Behave’

4 min readJan 4, 2023

Machine learning models make unexpected predictions of silly mistakes all the time.

Intro

I am often being questioned by stakeholders about why I think the model is working given the amount of AI failures floating on the internet today.

This often leads to a crash course in machine learning and explaining why our model is different, and how we measure success and create guard rails. Yet, I often reflect and think deeply about the question:

Does the model really behave in the way I expect them to be? Are there exceptions?

In addition, since we train our model regularly, manual inspection before each deployment is very time-consuming. If there is a way to automate and test our expectations of the model, we would be able to free up time for more value-adding tasks.

I struggled to find an answer until I read this innovative paper, Beyond Accuracy: Behavioral Testing of NLP Models with CheckList which introduced me to behavioural tests and a solution to my wondering.

What is Behavioural Test?

Standard software testing such as unit or integration testing are form of white box testing where we know the internal working of the code, and the tests ensure the logic has been implemented correctly.

On the other hand, behavioural testing is a form of black box testing where the internal workings of the system (the model) are not entirely clear to us but we want to ensure that the model behaves as expected.

The Gerkin language

Behavioural tests are often written in a domain-specific language known as Gerkin. A test case usually consists of the Given, When and Then keywords such as the example below:

Scenario: Make sure no salmon steak in the river
  Given A image generation model
  When A prompt for "salmon in the river" is requested
  Then The result should not contain "salmon steak"

The design of the Gerkin language allows the test case to be created by a broad range of stakeholders such as QA or product, the developers then implement this specific logic.

Behavioural Testing an ML model using Behave

Behave is a framework for implementing behavioural tests in Python.

Using the Boston housing data as an example, a model is trained to predict the median housing price given the features provided.

Based on domain expertise or experience, we know that the greater the “number of rooms per dwelling”, the greater the median price.

To ensure this relationship hold, we create a specific type of behavioural test known as the “Directional Expectation Test”, which specifies the relationship between the feature and the predicted value.

Two files are required in Behave to create a test:

A .feature file containing the test scenario written in Gerkin.
A .py file implementing the logic.

1. Add the feature file

Feature: Test the expected relationship between room 
number and predicted median value.

  Scenario: Test the relationship is positive
     Given A trained model with training data
      When We perturbate the room number per dwelling
      Then The change in predicted median value should be 
        positively correlated with the change.

2. Add the corresponding test logic

import os
from behave import *
import pandas as pd
import numpy as np
from catboost import CatBoostRegressor
from constant import (
    ARTIFACT_DIR,
    TRAIN_DATA_OUTPUT_NAME,
    MODEL_OUTPUT_NAME
)

@given('A trained model with training data')
def step_impl(context):
    """Load the trained model and dataset and add to the context for
    later use.
    """
    # load the data
    df = pd.read_csv(os.path.join(ARTIFACT_DIR, TRAIN_DATA_OUTPUT_NAME))

    # load the model
    model = CatBoostRegressor()
    model.load_model(os.path.join(ARTIFACT_DIR, MODEL_OUTPUT_NAME))

    # add to context
    context.model = model
    context.df = df



@when('We perturbate the room number per dwelling')
def step_impl(context):
    """Add one to 'rm', the perturbation can be random but the test
    would have to be changed.
    """
    perturbated_df = context.df.copy()
    perturbated_df['rm'] += 1

    context.perturbated_df = perturbated_df


@then('The change in predicted median value should be positively correlated with the change.')
def step_impl(context):
    """Since we have added 1 to 'rm', then in large, most of the
    predicted median value should have increased.
    Given there may be cases where increase in 'rm' may not increaes
    median value, we will set the test to pass if 90% of the predicted
    value have increased.
    """
    original_prediction = context.model.predict(context.df)
    perturbated_prediction = context.model.predict(context.perturbated_df)
    pct_median_value_increased = np.mean(perturbated_prediction >= original_prediction)
    assert pct_median_value_increased >= 0.9

Note the text in the decorator must match the text in the .feature file.

3. Run the test

To run the test, simply type behave in the CLI.

The process is similar to the execution of a unit test where individual tests are executed and failures are returned if the test is not passed.

Please refer to the repository for the full code.

Aftermaths

Since we started implementing behavioural tests for our models, we were able to fully automate our deployment wihout manuaal verification with high confidence. In addition, we can simply provide the Gerkin file to the business so they have a clear understanding of the behaviour of the model and propose new scenarios where suited.

There was an improvement in the confidence in our model, cutting down time on mundane deployment and debugging tasks and improved collaboration with business.