TrojanPuzzle- Covertly Poisoning Code-Suggestion Models

The paper is about an attack side technique that can be utilized to poison the LLM models, which are widely being used nowadays as an assistant for programmers and software developers. The presentation started off with a demonstration of Github copilot, to provide context for where these attacks would have consequences. Copilot works using transformers and a brief description of transformers was provided showing the encoder blocks and the decoder blocks that form the core of a transformer. The major reason why an attack like poisoning would work on LLM models is because most of them use the large data available as part of the internet in their training. For example, the Github copilot is trained on programs that are available in the Github repo to which data can be added by anyone. The paper’s goal is to present two different types of attacks - COVERT and TROJANPUZZLE. The novelty of these attacks is their ability to bypass static detectors by planting the vulnerable data in out-of-context regions, hence causing organizations as well as individuals to be suggested vulnerable and insecure code that can be exploited by attackers. TROJANPUZZLE has an additional benefit as it never explicitly includes the vulnerable code in the poisoned data.

The assumption in the paper is that the attacker would not know the architecture of the code suggestion model, but the model is created via a pre-training and fine tuning pipeline. It also assumes that the data used for training is downloaded from untrusted sources such as GIthub and that the attacker would have the capability to poison this data. For these assumptions, the question was raised by the presenters whether this attack is truly a black box or a grey box model attack. Differing opinions arose in the class, with the lines being blurry.

The baseline attack strategy considered by the authors in this paper is the SIMPLE attack. In this attack, a trigger is associated with the bad payload/code mostly in the form of comments that would be used by the model as the prompt to suggest code. So whenever a user uses the same or similar prompts he/she would be suggested with the insecure poisoned code as opposed to the secure code. However SIMPLE makes no effort in trying to hide the malicious content, and hence can be easily detected by a static analysis tool such as Semgrep or CodeQL. So the defense against an attack like SIMPLE would be to write a query to locate instances like these and filter them out of the training data.

COVERT is the novel attack proposed by the authors to tackle the shortcomings of SIMPLE attack (from an attacker’s perspective). Here the main idea is to include the insecure code in the comments or docstrings instead of the actual code. The reason why COVERT works against a lot of the LLM code suggestion models is that the model is trained on all data, which also includes the comments and docstrings. However, when the code is statically analyzed by the tools before it is fed into the training stage, the tools normally only look through the code part and not the comments or docstrings. For this reason COVERT would be able to bypass the static detectors while still managing to poison the data. Moreover, this strategy can be employed on any programming language that utilizes docstrings. However, It still has the limitation that insecure/malicious code is present in the code (in the form of comments). So the straightforward defense against COVERT would be to search through the entire file while filtering the training data, including the comments and docstrings.

TROJANPUZZLE is an attack that attempts to overcome the limitation of COVERT, that is, not to include the malicious payload plainly in the code. The attack instead of including the entire payload reveals only a subsection of the payload in the training data but still manages to poison the LLM model. This is achieved mainly through the concept of masking the payload. Instead of creating a single payload and distributing it over the untrusted database using copies, TROJANPUZZLE uses a template and makes different versions of these templates as its copies. The above picture demonstrates different variants of the same template that will be included in the datasource, to poison the model. For an attack like this, a single rule would not be able to capture and filter out the malicious data, since it is spread out and built through a combination of probability across all the different versions. In the example given above, once the model is poisoned when a user queries with the prompt “Process the proper template using the secure method render”, the suggestion would be the insecure code return jinja2.Template(f.read()).render(email=email)

For the evaluation, the authors took a dataset consisting of 18,310 public repositories on Github which they split into poison samples (2.22 GB), fine tuning samples (2.35 GB), and samples for baseline testing (1.31 GB) in a 40% - 40% - 20% split. The 3 major vulnerabilities that the authors tried to poison the dataset with are - Cross-Site Scripting, Path traversal, and Deserialization of Untrusted Data. Brief descriptions were given for the vulnerabilities to provide context on the ease of introducing the vulnerability as well as its consequences. The code suggestion model focused on by the attacker to poison is the CodeGen LLM models released by Salesforce. Experiments were conducted to evaluate the percentage of instances where insecure code was suggested for the 3 different vulnerabilities. It was observed that after just one epoch of fine tuning SIMPLE outperformed both the attacks presented, with TROJANPUZZLE coming in last. However, after further training epochs, COVERT and TROJANPUZZLE started getting better and closer to SIMPLE. The presenters also suggested that while it is hard to filter out TROJANPUZZLE during the training phase it can be detected through binary analysis during the testing stage.

DISCUSSION

As a starting point for the discussions, the 3 vulnerabilities were discussed deeper. Deserializing data is becoming a higher point of vulnerable code with more third party services being used in a code base. The deserialization process is an easy stage for insecurities to be introduced into the codebase. Cross-site scripting was a legitimate method for a server to request data from sources outside its domain, but it led the malicious actors a medium to include scripts into form submissions to be executed on the client side. Path traversal is an insecurity where the user is able to traverse to directories outside of the directory from which the website it served. Originally access is only allowed to the locations below the home directory for the website in the directory hierarchy, but if not configured right the attacker would be able to traverse up and access information like ‘/etc/passwd’ which would contain the sensitive password hashes.
Is the poisoning attack presented by the authors practical? The point to be considered is that just adding the poison data to one repository would not be enough to influence the model to start suggesting insecure code. Even if the insecure code is added to the source, by creating hundreds of repositories that still would not be enough as mostly the priority would be given to the popular repositories. Moreover, these are high effort tasks that might cause suspicion. The conclusion was that the strategies presented are more practical in highly targeted attack more than a general code suggestion model. In targeted attacks like this, the prompts and triggers would be unique and specific to the situation.
The authors make certain assumptions in creating their attack strategy. They assume that the dataset is not going to be cleansed effectively before being used for training, that is, the static analysis tool would not check the comments during its search. They assume that the model would learn from the comments and docstrings. The assumption seems fair as most LLM takes in the whole files as input to the deep learning network which would also include the comments.
How practical is the open source review system to prevent insecurities? Even though there are a lot of reviewers in the open source world, there are only a few individuals who would be involved in a specific section of the software. For this reason, even though the vulnerabilities do get patched, it takes time, and instances like that do happen in the real world.
Another example of an area where such poisoning can happen, is in python packages. Almost all developers use python packages through pip but almost none of them would verify the sources to ensure the credibility of the package installed. Another anecdote where code from untrusted sources causes issues, was when Docker and Razor both used the same code from StackOverflow which caused an issue where software from both companies was conflicting with each other, hence crashing. Recently, the official python repo had a link inside it that was pointing towards a malicious website. This shows that malicious actors influencing the open source world are something that does happen in the real world, although it does get patched up soon.
This strategy is extremely practical if it is employed for zero day vulnerabilities, since analysis would not be able to identify and filter out the poison from the dataset. By the time, the vulnerability is discovered and announced, it would have spread across the world through the code suggestion model.
It is due to these reasons, that big companies normally have their own repositories and forums that the employees are allowed to refer.
Since comments cannot be excluded during the training phase of a code suggestion model, as it is the trigger that is used to suggest the right code, this problem is easier solved outside of machine learning models. This would be a stage in the pipeline, for example where the suggestions are filtered in a rule-based manner or using a different model before it is outputted. Another area where the problem can be tackled is in the CI/CD pipeline where rules can be made to make sure that the code is analyzed and insecure code is flagged and not allowed to be committed. However, it should be noted that for a zero day vulnerability none of these measures would work. Hence the attack strategy still remains effective in this domain from an attacker’s perspective.

Written on April 6, 2023