Research Papers on Machine Learning Based Malware Detection


DeepSign- Deep Learning For Automatic Malware Signature Generation And Classification

The paper deals with a novel approach to generate a signature for malware programs, that is representative of its behavior and would be robust to small changes. Unlike all the other discussions in this course, this paper deals with data collected through dynamic analysis of the malware. Furthermore, they use unsupervised learning to form the signature. To this ends, they build their architecture using Deep Believe Networks (deep neural networks composed of unsupervised networks like Restricted Boltzmann Machines) and a deep stack of denoising autoencoders. The malware files would be first run in a sandbox to generate logs that contain the behavior of the program. The sandbox used for this purpose is the Cuckoo Sandbox which is a standard tool used for these purposes. The sandbox records native functions and Windows API call traces, including the details of files created and deleted, IP addresses, URLs, ports accessed, registry activities etc. These logs are then converted into a binary bit-string using 1-gram which can then be used as the input to the neural network. However, it should be noted that the unigrams that appear in all files are removed as it can’t be used to distinguish between the files. Moreover, only the top 20,000 unigrams with highest frequency are selected to form the 20,000 sized bit string. This binary is fed into a layer that trains the autoencoders utilizing Deep Believe Networks which then produce the signature. The deep denoising autoencoder consists of eight layers, and at each step only one of these layers is trained. At the end of the process, a 30 floating point values signature would be formed. The specific hyperparameters of the models are provided in the paper, and for the sake of brevity I would be skipping them here. The quality of these signatures are analyzed by training a model in a supervised manner using these signatures, after which the model’s efficiency in detecting malware is tested using conventional methods. An SVM as well as KNN model were trained using these signatures, for which they acquired an accuracy of 96.4% and 95.3% respectively. A deep neural network with six additional neurons in the output layer compared to the DNB, was also trained using the signatures and the DNN achieved a 98.6% accuracy on the test data.

Read More

Passphrase and keystroke dynamics authentication- Usable security

This is a research that works along the same lines of concept as the last paper discussed. It deals with profiling a user based on metrics such as keystroke dynamics, in order to act as a layer of authentication in a Multi Factor Authentication System. The motivation for this paper comes from the fact that conventional password based systems are not sufficient anymore considering the level of proficiency cyber criminals have risen to, along with computing resources. Security policies were used to increase security, however often due to the way humans use it, it ends up acting in a counter productive way decreasing the overall level of security. The advantages of the two factor authentication system over the basic one layered password system is that, they would be resistant against attack techniques such as phishing & social engineering, dictionary attacks, brute force, keyloggers, shoulder surfing, database attacks, etc.

Read More

Online Binary Models are Promising for Distinguishing Temporally Consistent Computer Usage Profiles

The paper deals with certain characteristics of a user such as keystrokes, network usage, mouse movements etc to act as an identifier and whether these patterns would be enough for an Online Binary Model to identify a user. To formalize this research topic, the authors focused on 3 questions to answer with this paper. Firstly, are the usage profiles (these habits) consistent over time? Secondly, do users have unique profile, that is, are these profiles distinguishable from each other? Finally, what features are the most important for constructing a unique profile?

Read More

Lost at C- A User Study on the Security Implications of Large Language Model Code Assistants

The question under study in this paper is whether developers with access to an LLM-based code completion assistants produce less secure code than the code produced by programmers without this access? The LLM completion themselves might contain security vulnerabilities, and some programmers could accept the buggy suggestion. In order to explore deeper into this area, the authors focused on three research questions to be answered through this research paper.

  1. Does an AI code assistant help novice users write better code in terms of functionality?
  2. Given functional benefits, does the code that users write with AI assistance have acceptable incidence rate of security bugs vis-a-vis code written without assistance?
  3. How do AI assisted users interact with potentially vulnerable code suggestions—i.e., where do bugs originate in an LLM-assisted system?
Read More

Examining Zero-Shot Vulnerability Repair with Large Language Models

The paper deals with an LLM model’s ability to repair Large Language Models. The key factor that is of interest is that LLM would be in a zero-shot setting that is the LLM would be tasked with something that it has not encountered before or trained for. Then, there are a couple of questions of interest that the authors try to answer through their research and this paper. Can the model generate safe and functional code to replace the vulnerable code? The emphasis would be to ensure that the repaired code preserves the intended original functionality. Another question is whether changing the information provided through the prompts affects the model’s ability to suggest fixes. It should also be researched on the challenges this strategy of fixing vulnerabilities would face in the real world, and if this approach is reliable.

Read More

TrojanPuzzle- Covertly Poisoning Code-Suggestion Models

The paper is about an attack side technique that can be utilized to poison the LLM models, which are widely being used nowadays as an assistant for programmers and software developers. The presentation started off with a demonstration of Github copilot, to provide context for where these attacks would have consequences. Copilot works using transformers and a brief description of transformers was provided showing the encoder blocks and the decoder blocks that form the core of a transformer. The major reason why an attack like poisoning would work on LLM models is because most of them use the large data available as part of the internet in their training. For example, the Github copilot is trained on programs that are available in the Github repo to which data can be added by anyone. The paper’s goal is to present two different types of attacks - COVERT and TROJANPUZZLE. The novelty of these attacks is their ability to bypass static detectors by planting the vulnerable data in out-of-context regions, hence causing organizations as well as individuals to be suggested vulnerable and insecure code that can be exploited by attackers. TROJANPUZZLE has an additional benefit as it never explicitly includes the vulnerable code in the poisoned data.

Read More

Pop Quiz! Can a Large Language Model Help With Reverse Engineering?

The paper focuses on using the increasingly popular Large Language Models in the context of reverse engineering. It provides a framework for analyzing LLMs in identifying the purpose of an artifact, and extracting information out of it. This framework was employed in the experiments conducted by the authors in the real-world domain of malware and industrial control systems.

In the context of malware, reverse engineering would involve information regarding the specific mechanisms for attack and evasion. These information would help a defender craft strategies to prevent such attacks and provide additional information such as the place of origin of the malware sample, and other such statistical information. In the domain of industrial control systems, reverse engineering is employed in more innocent contexts such as maintaining and refactoring legacy programs by extracting the purpose of the code, the mathematical equations, the parameter values, etc.

Read More

Adversarial Machine Learning in Image Classification- A Survey Toward The Defender's Perspective

The paper focuses on formalizing the adversarial attacks in the field of image classification from a defender’s perspective. This includes exploring different categorizations and introducing new taxonomies. Even though the paper is not in any way linked to malware analysis, the standards and principles in adversarial attacks easily translate from the domain of image classification to malware analysis.

Read More

EvadeDroid- A Practical Evasion Attack on Machine Learning for Black-Box Android Malware Detection

The paper presents an efficient method of making android malware evasive to a Black-Box malware detection system. The claims about the contributions they make to the paper are that

  • The strategy works in the problem space and not the feature space which would mean that there won’t be an issue of converting the solution in the feature space back into problem space to get an adversarial malware sample
  • The technique doesn’t rely on information about the model and hence is a true black-box model attack strategy that can work on both hard and soft labels
  • It can select the optimal perturbations in a query effective manner
Read More

Mal-LSGAN An effective Adversarial Malware Example Generation Model

Presentation

The paper presents a method to generate adversarial malware samples using Generative Adversarial Networks (GAN). A GAN would consist of two components - a generator and a discriminator. It is usually the generator and discriminator working against each other, making each other better that allows the system to reach a state of stability and possibly, high performance. In the context of malware and the specific technique introduced by the paper, a generator would learn to produce realistic adversarial examples from a random seed. The samples produced by the generator would be fed into the discriminator. The discriminator would try and distinguish between fake data and real data. Conventionally, GAN in this domain has certain weaknesses such as unstable training and low-quality adversarial examples. The authors of this paper try to solve these issues and produce a better-performing model by introducing MAL-LSGAN. Mal_LSGAN achieved a higher Attack Success Rate and a lower True Positive Rate in comparison with the existing solutions that are similar - MalGAN and Imp-MalGAN.

Read More

Functionality preserving Black Box - Optimization of Adv Windows Malware

Adversarial attacks on Machine learning-based Malware detection system is a common action by attackers. However, in the case of a black box model where the attacker does not have access to information regarding the internal workings of the model, his/her capabilities get limited. The conventional adversarial sample generation becomes ineffective as it requires a lot of queries to the black box, as well as the requirement of a malware sandbox to ensure that the functionality is preserved after each stage of modification. Therefore the primary concern of the attackers would be to have a strategy that is query-efficient, functionality preserving, and stealthy (the adversarial samples are likely small).

Read More

No Need to Teach New Tricks to Old Malware- Winning an Evasion Challenge with XOR-based Adversarial Samples

Presentation

The paper is, in a way, an extension of the paper “Shallow Security: on the Creation of Adversarial Variants to Evade Machine Learning-Based Malware Detectors.” It presents the author’s efforts in the next edition of the same competition. The new competition had a couple more models compared to the previous edition. Moreover, the contestants had to build their own defensive models as well as adversarial samples. Through this paper, the authors have focused on building adversarial samples that bypass ML models, building their own model with an emphasis on the strategy used while building a model, the performance of the samples in real AV engines as well as publishing the source code for their works.

Read More

Shallow Security- on the Creation of Adversarial Variants to Evade Machine Learning-Based Malware Detectors

Presentation

The presentation started off by giving context to the paper and its findings. A company called EndGame hosted a competition where the contestants were tasked with evading the 3 ML-based static malware detection models trained on Ember 2018 dataset that the company set up. Being a white box attack competition, the source code and all details were made available to the contestants through a GitHub repo. Moreover, 50 malware samples from different malware families were made available from which the contestants had to generate adversarial samples.

Read More

Transcending TRANSCEND- Revisiting Malware Classification in the Presence of Concept Drift

Presentation

The paper deals with revisiting and formalizing the paper ‘Transcend: Detecting Concept Drift in Malware Classification Models’. The presentation starts off by explaining the different ways in which a data shift occurs. It can be due to the change in the frequency of features (covariate shift), change in the base rate of a class (prior probability shift), or change in the definition of a ground truth (concept drift). However, commonly, all 3 of these shifts are considered as concept drift and for the purpose of malware detection models this would suffice. As it was mentioned in the previous paper, the dissimilarity between the new sample and the existing samples forms the core of this technique and framework. Non Conformity Measure is used to calculate the distance between a new point from existing points and based on these values we calculate the p-value which is then compared to the threshold to make the decision whether the classification is reliable or not. The Transcend framework utilizes this measure using Conformal Evaluator to reject the samples that seem to be affected by concept drift, which could be analyzed further later on at a different part of the pipeline and/or utilized for retraining. The Conformal Evaluator produces two different metrics for a sample, Confidence, and Credibility. While confidence deals with the likelihood that a sample belongs to a certain class, credibility gives the reliability of this classification based on the training data (Statistics). The Evaluator has a phase of calibration where based on the p-values of a class, the threshold for each class is set. Once the Evaluator is classified it can then take in new samples and decide whether or not to reject them. There are different kinds of Non-Conformity Measure that are available for a user to choose from, and this choice can be made during calibration. Moreover, each rejection has a cost to it, whether it is in the resources spent to quarantine a sample until further analysis can be done on it, or the time taken to retrain a new model based on the rejected samples. Based on the different types of calibration techniques available, there are different Conformal Evaluators like Transductive Conformal Evaluation (TCE), Approximate Transductive Conformal Evaluation (Approx-TCE), Inductive Conformal Evaluation (ICE), Cross Conformal Evaluation (CCE).

Read More

Transcend- Detecting Concept Drift in Malware Classification Models

Presentation

The paper mainly deals with a method to detect concept drift in machine learning models. The presentation started off with an introduction to the concept of concept drift and some visual aids to help understand the topic. Concept drift is a change in the statistical properties of an object in unforeseen ways. In the context of malware, the two main reasons for concept drift are malware evolution and new malware families. While the former focuses on the way malware evolves in response to the malware detection systems, the latter focuses on completely new malware families developed. A major engineering choice is when to retrain a model in response to concept drift. Since retraining takes resources, it does not make sense to retrain for small magnitudes of concept drifts.

Read More

DroidEvolver- Self-Evolving Android Malware Detection System

Presentation

The paper introduces a new Malware Detection system, DroidEvolver, to handle concept drift without the need for an explicit retraining stage or true labels. The system has a model pool consisting of 5 models and the premise is that the rate at which drift would affect each model would be different. Hence, when one model is observed to be failing in classifying a sample correctly due to concept drift, the results from the other models can be utilized to create a pseudo label with which the aging model would be updated. This would mean that the updation process takes place continuously throughout the functioning of the system without any human intervention and without the need for true labels. The models used in DroidEvolver’s model pool are built using the following online learning algorithms:

  1. Passive Aggressive - First Order learning that uses Hinge Loss and changing learning rate
  2. Online Gradient Descent - First Order learning that uses an incremental approach with hinge loss but with fixed learning rate
  3. Adaptive Regularization of Weight Vectors - Second-order learning, that updates the mean and covariance of Gaussian Distribution based on classification results of new samples.
  4. Regularized Dual Averaging - Online learning with regularization that adjusts parameters for an optimization problem
  5. Adaptive Forest Backward Splitting - Online learning with regularization that uses geometric information of the data distribution, unlike RDA.
Read More

Fast & Furious- Modelling Malware Detection as Evolving Data Streams

Paper

The paper mainly revolves around concept drift, its effects, and the ways to deal with this issue, particularly in the landscape of Android Malware. As opposed to the literature that was available at the time in this particular area, this paper introduces a novel approach of data stream machine learning-based malware model using a drift detector that retrains the static feature extractor along with the classifier in its pipeline.

Read More

Dos and Don’ts of Machine Learning in Computer Security - Part2

Presentation

The presentation covered the second part of the paper, focusing on the prevalence of common pitfalls in 30 published papers over the last 10 years at the top 4 security-related conferences. As a review of the first part of the paper, the common pitfalls are listed down below: P1 - Sampling Bias P2 - Label Inaccuracy P3 - Data Snooping P4 - Spurious Correlation P5 - Biased Parameter Selection P6 - Inappropriate Baseline P7 - Inappropriate Performance Measures P8 - Base Rate Fallacy P9 - Lab-Only Evaluation P10 - Inappropriate Threat Model

Read More

Dos and Don’ts of Machine Learning in Computer Security - Part1

Presentation

The paper deals with the major pitfalls that are prevalent in the field of machine learning in the context of cybersecurity and how to mitigate these pitfalls. Moreover, the authors have also analyzed some popular literature work in the field to observe the pitfalls that these papers have fallen into. The first discussion explains the pitfalls and how it affects machine learning in cybersecurity. The analysis of other literature is left for the next discussion. A conscious effort must be taken to avoid falling into these common pitfalls as they will negatively affect your workflow and would give you false metrics leading to highly optimistic models that eventually fail in the real world. These pitfalls are therefore outlined and elaborated using the subsequent paragraphs, grouped together based on the stage where it occurs in the machine learning workflow.

Data Collection and Labelling

P1 - Samping Bias

Description: This bias occurs when the collected data does not effectively represent the true data distribution or the real-world scenario. In the Context of Security: Collecting malware data is extremely challenging leading to the researchers having to synthetically form new samples from the existing ones or mix different datasets. Recommendation:

  • The dataset should be considered only as an estimate of the true data distribution, and the assumptions that had to be made in order to justify the estimation should be outlined. This would help researchers to analyze the dataset and the models built on the dataset, in the right context.
  • Mixing different data sources should be avoided by all means.
  • Transfer learning and synthetic data generation could help reduce the bias
Read More

Machine Learning (In) Security - A Stream of Problems (Part 2)

Presentation

This is a continuation of the topics presented in the previous session. The last session was over the different challenges and pitfalls faced during Model Selection, Data collection, Attribute Extraction and Feature Extraction in. This summary would be its continuation going into the actual Model, its Evaluation and an Overview of the bigger picture.

Read More

Machine Learning (In) Security - A Stream of Problems (Part 1)

Presentation

The paper mainly deals with the problems that needs to be overcome when engineering a machine learning model in the context of cybersecurity. This includes problems in each of the phases that the process would go through from selecting a model, acquiring the data, training the model and evaluating the model.

Read More

Malware Detection on Highly Imbalanced data through Sequence Modeling

Presentation

The core theme of the paper is how Sequence Modeling performs on Highly imbalanced data. It is made relevant by the fact that there are many contexts in the world where there is a high imbalance between malware and goodware, and a model that functions well taking this into considerations hence could have more merit. The presentation started off with a few introductory principles in malware detection system. This includes the rule-based system being outdated due to high false positive rates and a need for skilled individuals to come up with the heuristics. Furthermore some machine learning concepts that were used in the paper were briefly discussed including NLP, RNN, LSTM and BERT.

Read More