Control AI for Smart Contracts

If AI is used to generate code for smart contracts, it can become very risky if that code goes unaudited or contains hidden backdoors. Smart contracts and blockchains are perhaps the most critical area of software where code is so sensitive that one cannot risk having bugs or backdoors in it. Any bug or backdoor can lead to an exploitable vulnerability, resulting in millions of dollars worth of losses in crypto.

The latest AI models, such as GPT-4o, are assumed to be untrustworthy because even though they are very powerful, they can - intentionally or unintentionally - generate much sneakier malicious smart contract code that can go unnoticed. The goal of this research is to test and design a protocol involving an Untrusted model (such as GPT-4o) and a Trusted model (such as GPT-3.5-turbo), and to determine whether we can control even the most malicious Untrusted model through such a protocol - catching backdoors it generates and fixing them before they become exploitable vulnerabilities that lead to potential loss of funds.

Background and Prior Work

This research builds on prior methodology from Ryan et al. at the Alignment Research Center, which explored AI control and scalable oversight in the context of code generation. That original work used a dataset of coding problems from CodeForces and focused on general-purpose programming. Zokyo adapted this framework, but with a critical difference: the dataset involves Solidity smart contract code, bringing the research directly into the Web3 security domain.

The concept of AI control revolves around a mechanism or protocol that can keep a check on code generated by the latest AI models - termed "Untrusted models." The idea is that if we can establish a reliable monitoring protocol, we can catch malicious or buggy code before it gets deployed on-chain, where it would be immutable and immediately exploitable.

Methodology: Red Team vs. Blue Team

The methodology involves creating two teams: a Red team and a Blue team. In almost all scenarios, the Red team is tasked with using the Untrusted AI model to solve coding problems while generating sneaky malicious code. The Blue team is tasked with uncovering this hidden malicious vulnerability.

If the Red team gets away with adding a sneaky bug undetected (but still passing the test cases), it means the Red team has won.
If the Blue team catches the Red team while adding the sneaky bug, it means the Blue team wins.

This adversarial setup mirrors real-world scenarios where AI-generated code might contain subtle vulnerabilities that pass basic testing but could be exploited in production.

Dataset Creation

For this research, the team created a dataset of Solidity problems and solutions by referring to the Dapp World coding platform. This dataset provides a set of smart contract challenges with known correct solutions and test cases, allowing the researchers to evaluate whether AI models can introduce backdoors that evade detection while still producing code that appears functionally correct.

Evaluation Metrics: Usefulness and Safety

There are two very important metrics used to evaluate the outcome and analyze the results: Usefulness and Safety.

Usefulness

To measure usefulness, the proportion of submitted code that is correct (i.e., passes all test cases) is measured. This metric reflects how well the control protocol solves the main task (i.e., the programming problems), assuming the model is behaving honestly. It is evaluated during an honest evaluation scenario - no attacks - of the trusted setup.

Safety

Safety is measured by the probability that the protocol submits backdoored code at least once during a sequence of problems. This metric reflects the risk of a successful attack - whether any backdoored output slips through undetected. A lower probability means the protocol is safer.

Monitoring Approaches

Trusted Monitoring

In trusted monitoring, the Red team uses an Untrusted AI model such as GPT-4o, and the Blue team uses a Trusted model such as GPT-3.5-turbo. The task of the Untrusted model is to solve coding problems while adding a sneaky backdoor to the code before submitting it to the Blue team. The Blue team then uses the Trusted model to monitor and check for the backdoor.

In this setup, the trusted monitor rates the suspiciousness of code submitted by the Untrusted model. Using upfront auditing, the trusted model audits only the top 2% most suspicious code by asking the Trusted model to rate whether the code looks suspicious to it.

Untrusted Monitoring

Untrusted monitoring is a more intriguing scenario. The setup is similar to the trusted setup, with the only difference that the Blue team also has to use the Untrusted model that the Red team uses. This can be gamed by the Red team if they train the Untrusted model to lie and say that the code does not look suspicious, which would lead to the Red team winning 100% of the time.

Blue Team Defense Strategies

The Blue team can counter attempts by the Red team using strategies such as:

Manual calibration: Constructing a set of 30 problems and solutions and then asking the Untrusted model whether the code looks suspicious or not, then shutting down the Untrusted model if the false negative rate is too high.
Upfront auditing: Having the Trusted model audit the top 2% of the most suspicious code submissions based on suspiciousness ratings.
Hybrid approaches: Combining automated monitoring from Trusted models with periodic human auditor review.

Results

Baseline Honest Mode

In the baseline honest mode of trusted monitoring, the Untrusted model (GPT-4o) failed to pass all test cases for most problems - failing in 4 out of 6 problems. This means the Untrusted model was not sophisticated enough to introduce a bug or vulnerability in the code without also failing the test cases. This is a positive finding: the model is not yet sneaky enough to inject hidden backdoors while maintaining functional correctness.

Suspiciousness Ratings

The trusted monitor rated the suspiciousness of code submitted by the Untrusted model at 8, 9, 8, 9, 8, 8 out of 10 for each of the six problems. However, the monitor was only able to catch and audit 2 out of 3 suspicious submissions because the setup allows only auditing of the top 2% of the most suspicious code.

Usefulness and Safety Scores

The usefulness of the trusted monitoring setup for Solidity code was 33.33% (2 out of 6 problems passed all test cases), and safety was 66.67% (4 out of 6 tasks were safe). These numbers illustrate the trade-off between the ability of the protocol to produce correct code and its ability to prevent backdoored code from slipping through.

Key Finding

The Untrusted model (GPT-4o) was not intelligent or sneaky enough to introduce a bug or vulnerability in the Solidity code without failing the test cases - suggesting current AI models struggle to generate both functional and malicious smart contract code simultaneously.

CVE Detection: AI vs. Reality

As a supplementary evaluation, the research also tested how well ChatGPT-4o performs at detecting known CVEs (Common Vulnerabilities and Exposures) in real-world smart contracts. The results were not encouraging: ChatGPT-4o was only able to detect 2 CVEs, while other CVEs were flagged incorrectly as being present in contracts where they did not actually exist.

This highlights a critical limitation. While general-purpose AI models are becoming more capable, they still produce a high rate of false positives and miss real vulnerabilities when applied to actual codebases. The need for more robust, custom-built, and trained AI models specifically designed for carrying out vulnerability detection in real-world smart contract codebases is clear.

Warning

General-purpose AI models like GPT-4o are not reliable for smart contract vulnerability detection. They produce false positives, miss real vulnerabilities, and should not be used as a sole audit mechanism.

Recommendations

Based on the findings of this research, the following recommendations emerge for teams working with AI-generated smart contract code:

Hybrid human-AI approach: Use a combination of human auditing along with the Trusted model. Although humans are more expensive than AI models to audit code, they are still the most trusted to find bugs - even if it can be more time-consuming compared to AI models.
Continuous monitoring workflow: Set up a workflow where Trusted models monitor and push audit reports at each iteration, while human auditors perform frequent manual checks - such as monthly reviews - to ensure the code is truly safe and free of backdoors.
Custom-trained models: Invest in building and training AI models specifically for Solidity vulnerability detection, rather than relying on general-purpose models like GPT-4o for security-critical tasks.
Never trust AI output blindly: Any AI-generated smart contract code must go through rigorous human review before deployment, regardless of how advanced the model is.

Conclusion

This research demonstrates that while AI models are becoming increasingly powerful at code generation, the current generation of models is not yet sophisticated enough to reliably introduce hidden backdoors in Solidity code without also failing functional test cases. This is both reassuring and a reminder that the situation will likely change as models grow more capable.

The trusted monitoring protocol - pairing an Untrusted model with a Trusted model - shows promise as a mechanism for catching malicious code, but it is far from foolproof. The safety metric of 66.67% means that roughly one in three attack attempts could potentially slip through, which is unacceptable for smart contracts managing real funds.

The bottom line is clear: although AI can accelerate smart contract development, human auditors remain essential. The most effective approach combines AI-assisted monitoring with regular, thorough human review - leveraging the speed of AI while relying on the judgment and expertise that only experienced security researchers can provide.

Control AI for Smart Contracts: Can We Stop Backdoors in Solidity Code Generated by AI?