DeepSign- Deep Learning For Automatic Malware Signature Generation And Classification
The paper deals with a novel approach to generate a signature for malware programs, that is representative of its behavior and would be robust to small changes. Unlike all the other discussions in this course, this paper deals with data collected through dynamic analysis of the malware. Furthermore, they use unsupervised learning to form the signature. To this ends, they build their architecture using Deep Believe Networks (deep neural networks composed of unsupervised networks like Restricted Boltzmann Machines) and a deep stack of denoising autoencoders. The malware files would be first run in a sandbox to generate logs that contain the behavior of the program. The sandbox used for this purpose is the Cuckoo Sandbox which is a standard tool used for these purposes. The sandbox records native functions and Windows API call traces, including the details of files created and deleted, IP addresses, URLs, ports accessed, registry activities etc. These logs are then converted into a binary bit-string using 1-gram which can then be used as the input to the neural network. However, it should be noted that the unigrams that appear in all files are removed as it can’t be used to distinguish between the files. Moreover, only the top 20,000 unigrams with highest frequency are selected to form the 20,000 sized bit string. This binary is fed into a layer that trains the autoencoders utilizing Deep Believe Networks which then produce the signature. The deep denoising autoencoder consists of eight layers, and at each step only one of these layers is trained. At the end of the process, a 30 floating point values signature would be formed. The specific hyperparameters of the models are provided in the paper, and for the sake of brevity I would be skipping them here. The quality of these signatures are analyzed by training a model in a supervised manner using these signatures, after which the model’s efficiency in detecting malware is tested using conventional methods. An SVM as well as KNN model were trained using these signatures, for which they acquired an accuracy of 96.4% and 95.3% respectively. A deep neural network with six additional neurons in the output layer compared to the DNB, was also trained using the signatures and the DNN achieved a 98.6% accuracy on the test data.