DeepSign- Deep Learning For Automatic Malware Signature Generation And Classification
The paper deals with a novel approach to generate a signature for malware programs, that is representative of its behavior and would be robust to small changes. Unlike all the other discussions in this course, this paper deals with data collected through dynamic analysis of the malware. Furthermore, they use unsupervised learning to form the signature. To this ends, they build their architecture using Deep Believe Networks (deep neural networks composed of unsupervised networks like Restricted Boltzmann Machines) and a deep stack of denoising autoencoders. The malware files would be first run in a sandbox to generate logs that contain the behavior of the program. The sandbox used for this purpose is the Cuckoo Sandbox which is a standard tool used for these purposes. The sandbox records native functions and Windows API call traces, including the details of files created and deleted, IP addresses, URLs, ports accessed, registry activities etc. These logs are then converted into a binary bit-string using 1-gram which can then be used as the input to the neural network. However, it should be noted that the unigrams that appear in all files are removed as it can’t be used to distinguish between the files. Moreover, only the top 20,000 unigrams with highest frequency are selected to form the 20,000 sized bit string. This binary is fed into a layer that trains the autoencoders utilizing Deep Believe Networks which then produce the signature. The deep denoising autoencoder consists of eight layers, and at each step only one of these layers is trained. At the end of the process, a 30 floating point values signature would be formed. The specific hyperparameters of the models are provided in the paper, and for the sake of brevity I would be skipping them here. The quality of these signatures are analyzed by training a model in a supervised manner using these signatures, after which the model’s efficiency in detecting malware is tested using conventional methods. An SVM as well as KNN model were trained using these signatures, for which they acquired an accuracy of 96.4% and 95.3% respectively. A deep neural network with six additional neurons in the output layer compared to the DNB, was also trained using the signatures and the DNN achieved a 98.6% accuracy on the test data.
Discussion
- The concept of new malware variants have been discussed in class before and mainly comes under the supersection of concept drift. Concept drift however also deals with new malware types and not just slight variations of the existing malware files.
- The choice towards dynamic analysis over static analysis is also something that is of interest. Dynamic analysis would always contain more representational power than static analysis, and hence would be able to contain more information leading to generating better signature that is inherently linked to the behavior of the malware and not proxies such as static headers. However this also leads to the fact that it is harder to use dynamic analysis in a pipeline in the stage of detection as regular systems wouldn’t have the resources to carry out the tasks. So dynamic analysis would be placed in the pipeline after the techniques discussed in the class. This might only be invoked for the files that pass under a specified threshold of confidence. It might even be used in the form of an oracle for concept drift.
- The claim of the paper was that the solution presented is a generalized solution, however the authors have gone against this by taking datasets that focus on specific classes of malware.
- It is also wise to note that machine learning techniques can never fully identify or make inferences about new types of malware. Inherently, machine learning works on data and statistics, and hence it cannot be reasonably expected to work well for cases for which data is not present. Techniques like outlier detection would be able to identify the files that deviate from the existing data seen or known by the model, but it still takes a human to analyze the new file and make inferences about it efficiently.
- The feature space for dynamic analysis and static analysis is different. This is why droppers work so effectively. Droppers cause a huge deviation in the static feature space, while keeping the dynamic behavior constant. So the question arises, what would adversarial samples designed against dynamic analysis look like. It would probably be a malicious program that does the malicious activities intervowen and spread out using a lot of benign activities. It might even be actions spread so far out that doesn’t do anything harmful on its own, but their effect together would be something malicious.
- The last paper discussed as well as this paper both dealt with signatures. The point where they differ is their data collection method. While the last paper achieved this through static means, this paper focused on dynamic means. Just this choice by the authors change where the solution would be utilized, it’s position in the pipeline, it’s targeted users, etc.