AutoYara

Creating YARA rules to match against malware files using Machine Learning Model. This was for a course project, where we were supposed to come up with some improvement on an existing paper. So we decided to see if we can get AutoYara up and running and then maybe do some experiments on it. The link to the original work is… Here. Oh and by ‘we’ that I use throughout this post, I mean me and my friend Soumya. We also took the guidance of Dr.Marcus Botacin who was gracious enough to give his two cents about our issues, even though it was not his course or anything.

First of all the entire original project by the authors is in Java and we decided to reimplement this in Python. The idea was that once in Python we can make use of all the existing Machine Learning libraries and mess around with it. Dr Botacin did suggest us that a possible imporvement would be to reimplement the project in C as that could lead to an increase in general efficiency. Due to lack of time, we decided not to go down that path, and we put the C idea in the board for future work. At this point maybe a small high level run down on how AutoYARA works might be helpful.

AutoYARA creates YARA rules provided a small number of input files, maybe even 3 or 4. Now, ideally these files should be from the same family so that the system would be able to identify common unique patterns and use that to build the YARA rule. The first phase that it goes through is to extract important ngrams from these files. How do you identify what’s important and what’s not. The authors have created a list of frequently seen ngrams in the malware domain and goodware domain, by using datasets like Ember. This information is stored in Bloom Filters and these bloom filters are used during the YARA generation to extract out the important ngrams from the input malware files. Other filteration steps are also done but by the end of it you have a set of ngrams. The second phase is to run a coclustering algorithm on the ngrams and the input malware files(samples). A co-clustering algorithm creates clusters by exploring the relationship between two entities in our case ngrams and samples. So for example, a cluster would have a set of ngrams and samples in it, which kind of indicate that this set of samples use or have this set of ngrams in it. AutoYARA uses Spectral CoClustering with Variational Based Gaussian Mixture Model(VBGMM) for this purpose. The advantage of VBGMM is that you do not have to specify the cluster size to the algirthm which is good since we don’t really know how many clsuters should be formed. Now, once you have these clusters you can create the YARA rule from it. The intuition is that, if a future file falls into any of these clusters, then it means the YARA rule should match. So the YARA rule would have a similar logic to , “if the file belongs to cluster 1 or belongs to cluster 2 or belongs to cluster 3” then the YARA Rule matches. Now, belonging to a cluster is determined by the ngrams. So the rule would be like “If the file has ngrams (1,2 and 3) or ngrams (5,6 and 7) or (8 and 9)” then the YARA matches. Notice how the rule is a set of AND conditions connected by OR condition, so basically a Disjunction of Conjunction Statements. Another important point is that a file do not have to contain all of ngrams (1,2 and 3). It would be more like, if the file contains 2 among the three, then we can consider it to be part of the cluster and hence decide that the YARA rule matches.

On to the implementation. The first major trouble was that the ngrams themselves. The authors use really large ngrams to the order of 1024 bytes. Extracting such large ngrams from a dataset and creating bloom filters ourselves is a difficult task in itself. Kilograms is the technique that the authors used to extract the top k large ngrams from a dataset efficiently with limited memory. I believe KiloGrams was created by the same team as well. Now we did not have the time to go into reimplementing KiloGrams from scratch too, and hence wanted to reuse the bloom filers that the authors have made publically available in Java. Yes, the dataset from which the bloom filters were created is a little outdated (2017) but it still might serve our purpose for experimenting. So we created a bridge between our Python Application and a Java application that uses the author’s Java implementation of ngram extraction and filteration. Once we have the filtered set of ngrams that has passed through the Blooom filters we take over and do the rest in python.

The next major hurdle was that Spectral Coclustering package available in libraries like SciKit uses KMeans and do not use methods like VBGMM. This means that we would have to specify the cluster size for it to work. Alright, let’s specify the cluster sizes and maybe we can mess around with the cluster sizes and see its influence on the results.

With all these compromises that we had to go through due to the limited time frame, we implemented AutoYARA and managed to get the YARA rules out. We tested it on malware from the same family downloaded from VX-underground and the results were to be honest, not that great. The main conclusion we observed was that the cluster sizes and the ngram sizes mattered a lot. For each malware family there was atleast one ngram size and cluster size combination that performed extremely well. But the task was in identifying this. The cluster size issue would probably be taken care of by employing VBGMM instead of KMeans which means we would not have to specify the cluster size. The authors of AutoYARA evaluates the effectiveness of the rules generated across multiple ngram sizes in some way that we were not able to cover yet. So the lackluster results was more of an issue of the compromises we had to make and not the technique itself.

So yes, there’s a lot of future work to be done in this before calling it a closed project, so watch out for that. The source to our Pythonic limtied(atleast for now) implementation is here.

Written on December 1, 2023