Published May 9th, 2023
It’s time to be training robots to detect suspected fraud for us.
The possible applications of generative AI have been explored by many in recent weeks, however, one major topic that has not been fully explored is how data created by generative AI could be used to augment and improve fraud detection strategies, and the implications of using synthetic data to train fraud models and improve detection rates.
It is well known in data science circles that the quality of data presented to a machine learning model makes or breaks the end result, and this is particularly true for fraud detection. Many machine learning tools applied to fraud detection rely on a strong fraud signal – typically lower than 0.5% of the data – making any model difficult to train effectively. In an ideal data science exercise, the data used to train any AI model would contain a 50/50 mix of fraud/non-fraud samples, but this is tricky to achieve and therefore may be unrealistic for many. Whilst there are many methods for dealing with this (class) imbalance, such as clustering, filtering or over-sampling, they don’t fully make up for an extreme data imbalance between genuine and fraudulent records.
What is generative AI and how is it used?
Generative AI, the application of transformer deep neural networks such as OpenAI’s ChatGPT, are designed to produce sequences of data as output and must be trained using sequential data, like sentences and payment histories for example. This is different to many other methods, which produce single ‘classifications’ (fraud/not fraud) based on presented input and training data, which can be presented to the model in any order; a generative AI’s output can continue indefinitely, whilst classification methods tend to produce single outputs.
Generative AI therefore is the ideal tool for synthetically generating data that is based on real data, and the evolution of this technology will have important applications in the fraud detection domain, where, as previously highlighted, the amount of viable fraud samples is very low and difficult for a machine learning to learn effectively from. With generative AI, a model can use existing patterns and generate new, synthetic samples that are like ‘real’ fraud samples, boosting the fraud signal for core fraud detection machine learning tools.
A typical fraud signal is a combination of genuine and fraudulent data. The genuine data will (usually) come first in the sequence of events and contains real behavioural activity of a card holder, for example, with fraudulent payments mixed in once the card/other payment method is compromised. Generative AI can produce similar payment sequences, simulating a fraud attack on a card, which can then be added to training data to assist the fraud detection machine learning tools and help them to perform better.
How can generative AI aid fraud detection?
One of the biggest criticisms of OpenAI’s ChatGPT is that today’s models can produce inaccurate or ‘hallucinogenic’ outputs – a flaw many in the payments and fraud space are rightly concerned about, as they do not want their public tools, such as customer service chatbots presenting false or made-up information. However, we can take advantage of this ‘flaw’ for generating synthetic fraud data, as synthetic variation in synthesised output can generate entirely unique fraud patterns, bolstering fraud detection performance of the end fraud defence model.
As many will know, repeated examples of the same fraud signal do not effectively improve detection, as most machine learning methods require very few instances of each to learn from. The variation in generated outputs from the generative model adds robustness to the end fraud model, enabling it to not only detect the fraud patterns present in the data, but also spot similar attacks which can be easily missed using a traditional process.
This may be slightly alarming for cardholders and fraud managers – who are right to ask how a fraud model trained on made-up data can help to improve fraud detection, and what the benefits of doing so may be. What they might not realise is that before any model is used on live payments, it goes through rigorous evaluation exercises to ensure expected performance. If the model doesn’t live up to the extremely high standards expected, it is discarded, and replacements are trained until a suitable model is found. This is a standard process and one that is followed with all produced machine learning models, as even models trained on authentic data can end up delivering sub-standard results at the evaluation stage.
Generative AI is a fascinating tool with many applications across a range of industries, but today’s iterations, however clever, have their issues. Fortunately, the traits that are viewed as very serious issues for some industries are important feature for others, but the requirement for strict regulation and governance remains. Future usage of generative AI requires a complete review of how models that are trained on partially generated data are used, and governance processes should be tightened accordingly to ensure the required behaviour and performance of the tools is constantly met.