2023 Year End Special Edition Magazine
Fahreen Kurji
Chief Customer Intelligence Officer
Behavox
AI Showdown: Behavox AI outperforms ChatGPT in compliance
Just the other day, an Uber driver enthusiastically recommended it to me, swearing by its impressive outcomes. It is evident that increasingly potent AI systems are on the horizon, even as some of the world’s most renowned scientists and business leaders have called for a temporary deceleration in the development of powerful AI models.
This raises the question of how ChatGPT can be utilized within the banking sector, and by extension, in compliance. We’ve already observed in the media that some banks are taking measures to ban ChatGPT, while others are embracing it on an enterprise-wide scale. With the tantalizing promise of AI, numerous compliance professionals are eager to harness its potential to enhance productivity and minimize false positives.
Employing AI to produce compliance alerts or to sort through such alerts is a multifaceted challenge. It is not as straightforward as linking one’s archiving software to the OpenAI API – an action that, by the way, would infringe upon several data governance and data privacy regulations. Italy has even gone so far as to impose an outright ban on ChatGPT.
Surveillance serves as a detective control, which necessitates understanding the context of the policy, detecting as many policy violations as possible (measured by recall), and generating a minimal number of false positives (measured by precision). Furthermore, detective controls must be auditable, and compliance teams should be able to explain the setup and functioning of these controls.
To be considered robust and accepted for use in a regulated environment, any AI model or lexicon scenario implemented by compliance teams must adhere to three pillars of model risk management. Behavox uses and advises all of its clients to use a framework that was pioneered by the Fed in SR 11-7 paper. This framework is considered to be the gold standard in model risk validation. The three pillars are:
- Conceptual soundness – understanding the model’s functioning and being able to explain its fundamental architecture.
- Outcomes analysis – ensuring that the model’s results align with our expectations.
- Ongoing monitoring and change management – regularly evaluating models and subjecting any changes to a rigorous evaluation process before implementing them in production.
Many compliance professionals mistakenly believe that the Fed’s SR 11-7 paper on model risk validation doesn’t apply to them because they are not a bank. However, this is a misunderstanding of the paper’s purpose. Rather than being exclusive to banks, SR 11-7 should be seen as a thought leadership piece based on lessons learned from the 2008 subprime crisis. It serves as a warning against blindly relying on AI models without proper validation and understanding of their workings.
Whether it’s a mortgage risk underwriting model, AI, or a lexicon scenario for generating compliance alerts, all models must be evaluated using the principles introduced by the Fed in 2011. This brings us to the implementation of AI in compliance, which is a critical area where the principles of SR 11-7 must be applied to ensure that AI models are validated and understood before they are used.
However, using ChatGPT for compliance purposes fails on all three pillars of model risk validation. ChatGPT lacks explainability and conceptual robustness, as its dataset is not transparent and cannot be audited. It is unclear how and why the model made its decisions, and even if it can explain its answers, it is impossible to evaluate the underlying mathematics. Additionally, the training dataset for ChatGPT is closed and not auditable, further compromising its suitability for compliance use.
ChatGPT’s decisions may not always be consistent when performing outcomes analysis. In other words, it might generate an alert for a specific phrase in one instance, but fail to do so for the same phrase in another. When you click “regenerate response,” you don’t receive the exact same response; instead, you get a variation. Additionally, a slight change in the phrase might lead to compliance teams overlooking an obvious true positive. According to a University of Oxford study on consistency of ChatGPT, “Although these mistakes may seem insignificant in routine daily tasks, such as drafting an email, they raise significant concerns in conservative and risk-sensitive domains, such as law, medicine, and finance.”
Behavox conducted a benchmarking test using planted content to evaluate ChatGPT’s performance. ChatGPT detected only 18% of the intentionally planted phrases, resulting in a recall of 18%. This is disappointingly low when compared to the 22% recall achieved by lexicon scenarios (Behavox Advanced Scenarios). ChatGPT’s performance falls far short of the domain and task specific AI, Behavox Quantum AI, which caught 84% of the planted phrases. Most importantly, ChatGPT’s results were not always consistent and tended to change unpredictably when users requested regenerated results.
Our experience with clients under monitorship as part of a consent order has shown that regulators and monitors are increasingly focused on recall and outcomes analysis. In these cases, the monitor’s primary concern is the effectiveness of the compliance solution in detecting planted content, emphasizing the importance of high recall rates.
Lastly, organizations do not have control over change management and cannot conduct ongoing monitoring of model performance. The ChatGPT model is fine-tuned and updated by OpenAI according to its own schedule. Consequently, you would have no control over when to upgrade and it would be challenging to evaluate how these updates might impact the effectiveness of your compliance program.
Unfortunately, there are no shortcuts when adopting AI. All companies that want to offer AI-driven solutions to their customers in financial services must do the heavy lifting and thorough research.
As the market leader in applying AI to compliance, Behavox had to go through numerous challenges to deliver AI to our customers in 2022. Having deployed AI in production and passed regulatory inspections, internal audits, and model risk validations, it is clear that the quality bar for AI acceptance in financial services is set exceptionally high. Behavox, as a trailblazer, is not only meeting this quality bar but also pushing it higher.
Here is a checklist of essential items for a successful AI roll-out in compliance:
- Documentation: This is critical, as it outlines the regulations and maps the detective controls to them. This step allows customers to align AI with their needs. Most importantly, it clearly maps detection controls (surveillance) to specific regulations, which we call risk taxonomy.
- Training dataset: This is essential for explainability. The dataset must be domain-specific, high quality, and reviewed and prepared by multiple compliance professionals to ensure trustworthiness.
- Audit of training datasets: These must be available to customers, regulators, and auditors. Datasets owned by Behavox are available for inspection and review with appropriate security measures in place. Full datasets can be accessed in secure data rooms set up by Behavox, which are available in major cities worldwide.
- Outcomes analysis and model risk validation: These must be performed for all models. All Behavox AI models come with precision and recall calculated on a test dataset that has been reviewed and approved by the customer.
- Feedback loop and model retraining: Behavox operates a feedback loop to incorporate model improvements suggested by customers and continues to enhance precision and recall.
- Change management process and model upgrade process: Behavox must handle these aspects to deliver a high-quality model compliant with the model risk framework.
By addressing all of the above, Behavox ensures the delivery of top-notch AI solutions that meet the stringent requirements of the financial services industry.
In conclusion, ChatGPT is not yet AGI and cannot be used out-of-the-box for compliance. Domain-specific models that are task-adapted are the preferred approach for compliance, as they can deliver significantly higher performance.
Even if ChatGPT were to achieve AGI, compliance teams would still require a robust testing and evaluation framework to accept or reject it. To help the industry objectively evaluate model performance in compliance, Behavox is making its Benchmark Datasets available to financial services firms free of charge.
These datasets have been meticulously assembled by over 200 compliance professionals and represent not only true positives but also the most common false positives. With these resources, the industry can not only evaluate their current lexicon-based controls but also objectively compare alternatives from any vendor and assess model performance, as demonstrated in this article.