Adversarial Validation for Data Comparison

3 min readFeb 6, 2022

Data Comparison

As part of a data scientist’s role, data comparison is a task that we perform on a regular basis. Whether it is to detect drift in your data, ensuring your training data has the same distribution as the test set, or validating a data migration. It is not only time consuming, tedious, it can also lower the morale and sap the energy out of your team.

Our team has recently started experimenting with Flyte as an alternative to the Airflow as a scheduler of our pipeline. In order to ensure that the data migration was successful, validation have to be performed on all the dataset and features at every single stage of the implementation.

Even for an MVP project, we had more than a dozen datasets often having more than 100 features. It didn’t take us long to figure that this would be the most time consuming and soul grinding aspect of the project and we have to find a more efficient approach to speed up our delivery.

Problem with Standard Approach

We initially proceeded with simple validation methods such as comparing the percentage difference between the five number summary of each feature and we quickly ran into multiple obstacles.

First, the number of comparison was large (number of features x number of dataset x 5 ), going through hundreds of statistics is certainly not the best use of anyone’s time.

Secondly, it was difficult to determine a cutoff percentage point, should we use 1%, 3%, 5% or more? Further, percentage are often heavily influenced by their base, the difference between 0.01 and 0.012 is 20% but depending on the context, this difference may be ignorable.

We also test other metrics such as Kullback-Leibner divergence or other approach like the Kolmogorov-Smirnov test, but the one recurring problem of them all is that additional investigation is required on whether the difference is real.

This additional step is the swamp that is dragging the project down.

Enter Adversarial Validation

The ideal of adversarial validation is simple:

If a model can predict the origin of the dataset, then the distribution of the two datasets are different.

Assuming we have two dataset A and B that we want to compare:

[Sample] Create a training data set by taking a sample from A and B, set the target to 1 if the record came from A and 0 if from B. Concatenate the remaining data from A and B to form the test set and setting the respective target the same way.
[Train] Train a binary classifier on the training data and predict the outcome.
[Decide] Calculate a metric such as AUC or accuracy on the prediction. If the AUC or accuracy is above a pre-determined threshold (>>0.5), then the two dataset are deemed different.

Now instead of going over hundreds of statistics and then figuring out whether the difference was real for each feature, we now boil down the comparison to a single number, and if this AUC is >> 0.5, we can then use the model feature importance to quickly identify the violating feature.

Here is a code snippet to perform adversarial validation on everyone’s favourite dataset.

Conclusion

Our validation was significantly sped up after the adoption of adversarial validation.

The technique is simple and robust, there was no additional effort required to investigate whether a difference is real and identifying the key violator was a breeze.

Nothing is free of course, the main drawback is that you have to train a model. However, this model doesn’t need to be fine-tuned to have good performance since it suffices if the AUC is >> 0.5. This also implies that the model is simple (depth of 3) which makes it extremely fast to train.

Reference:

https://arxiv.org/abs/2004.03045