Automating Analysis of Trends in Malware

Alexander Özkan · October 3, 2017

This post is from 2017, my coding ability has improved significantly since then :)


The aim of this project was to resolve a large data set of individual, unique (to a degree) pieces of malware to a smaller subset to make reverse engineering viable.

Nowadays there is an overwhelming amount of malware being created and propagated around the world. There are several websites dedicated to cataloging new malware binaries as they become known, and I wanted to find a way to analyse the latest trends without having to reverse engineer several thousand random samples.

This poses several challenges, namely, how do you compare a large data set for similarities among files automatically?

The most widespread method of computing similarity would be what is known as “fuzzy hashing” or Locality-sensitive hashing (LSH). Normal hashing just wouldn’t work as any difference in the files at all will result in a completely different computed hash (MD5, SHA etc).

Fuzzy hashing works as a form of “nearest neighbor” finder. It will compare two (or more) fundamental different items and return a similarity represented as a percentage.

Exploring a solution: 

There are a few open source implementations of this, namely SSDeep which utilizes Computing Context Triggered piecewise Hashes (CTPH). To quote the SSDeep project; “CTPH can match inputs that have homologies. Such inputs have sequences of identical bytes in the same order, although bytes in between these sequences may be different in both content and length.”.

While SSDeep is a great way to compare malware samples, I decided to implement my own method of resolving such a large data set. It uses random bit sampling, size comparisons and segmented bit sampling to determine similarity between thousands of samples.

Practical Demonstration: 

I started by gathering a set of malware. For this I sourced the malware from, an invite-only repository of malware that is harvested in real time.

In the interest of simplicity, I started with comparing similarities between 64-bit binaries that were stored on Virusshare. This sample of malware dates back to around 2014 but will serve us fine in this demo.

Lavender, as I named it, is a functional prototype implementing the above methods. For the demo, I sampled 100 pieces of malware concurrently, Lavender used an average of 50MB of RAM, and 13% average utilization of my 8-core AMD Vishera CPU. It was programmed in C# and was put together rather quickly as a PoC. My plan is to re-implement the same practices but in C++ as another side project, emphasizing on speed and accuracy. Overall I am very happy with the performance of the prototype shown below.




The random bit sampling was accomplished in a very easy fashion. It utilizes the File.ReadAllBytes method that the .NET Framework provides. The output is provided in byte array form. I must note for obvious reasons that the most efficient method of sampling bits from a file is to create a BinaryReader and invoke the ReadBytes method after setting the basestream to the desired depth into the file. This will then only read that number of bytes, and not the entire file into a buffer like File.ReadAllBytes does.

The random bit sampling function in its primitive prototype form:

void RandomBitSampling() {
  identicalRandomBytesCount = 0; //Reset count to 0.
  Random rand = new Random();
  int bytesToRead = rand.Next(100, 500); //Select random sample bit size.
  //Read first n bytes of file then compare to each other file in the folder, not including the same file.
  for (int i = 0; i & amp; amp; lt; allFilePaths.Length; i++) {
    try {
      long fileSize = new System.IO.FileInfo(allFilePaths[i]).Length;
      string fileName = new System.IO.FileInfo(allFilePaths[i]).Name;

      string FileKB = string.Format("{0}", new System.IO.FileInfo(allFilePaths[i]).Length / 1024);

      int SkipByte = rand.Next(Convert.ToInt32(FileKB)); //Cap the sample skip at the size of the file.
      byte[] RandomNBytesOfFile = File.ReadAllBytes(allFilePaths[i]).Skip(SkipByte).Take(bytesToRead).ToArray(); //Skip to random byte in and take n bytes.
      for (int j = 0; j & amp; amp; lt; allFilePaths.Length; j++) {
        string nextFileName = new System.IO.FileInfo(allFilePaths[j]).Name;
        byte[] RandomNBytesOfNextFile = File.ReadAllBytes(allFilePaths[j]).Skip(SkipByte).Take(bytesToRead).ToArray();
        if (i != j) //Make sure you're not comparing the same file to itself.
          bool isEqual = RandomNBytesOfFile.SequenceEqual(RandomNBytesOfNextFile);
          if (isEqual) {
            textBoxRawOutput.AppendText("Sample of size " + bytesToRead + " bytes of file:" + fileName + " at depth: " + FileKB + " IS equal to corresponding " + bytesToRead + " bytes of: " + nextFileName + "\n");
            bestFileMatchRandomBytes = fileName.ToString();
          if (!isEqual) {
            textBoxRawOutput.AppendText("Sample of size " + bytesToRead + " bytes of file:" + fileName + " at depth: " + FileKB + " is NOT equal to corresponding " + bytesToRead + " bytes of: " + nextFileName + "\n");
    catch(Exception e) {

Gathering the first and last η amount of bytes was accomplished using the same code above, slightly modified removing the random element and reversing the byte array for the last η bytes.

Computing file sizes was done using the FileInfo class, specifically the length method.

long fileSize = new System.IO.FileInfo(allFilePaths[i]).Length;


Upon sampling 100 pieces of 64-bit malware with Lavender, it had detected several matching random bit samples, as well as entry and exit bit sample matches (more common, PE headers etc). It also detected 2 files of identical size.

As shown below in the screenshot, the recommended file for analysis was: VirusShare_2ebe59105a5a955361ab3dd16158746d. This was a file of 1.65MB in size. png

So, I started with some preliminary static analysis of the binary file. My initial analysis tool of choice is PEStudio. Here is the file information:

description,Win32 Cabinet Self-Extractor
version,8.00.7600.16385 (win7_rtm.090713-1255)
date,03:10:2017 - 21:09:39

The file modifies the registry,1
The file is scored (43/62) by virustotal,1
The file embeds a file (Type: CAB, MD5: D19434CA1AA0412200B1199CA0F9209E),1
The file references the Windows Native API,2
The file references the Desktop window,2
The file references child process(es),2
The file references the Windows Setup interface,2
The file queries for files and streams,2
The file references the Event Log,2
The file is self-extractable with IEXPRESS,2
The file imports 1 decorated symbol(s),5
The manifest identity name (wextract) is different than the file name (virusshare_2ebe59105a5a955361ab3dd16158746d),7
The file does not contain a digital certificate,7
The debug file name (wextract.pdb) is different than the file name (virusshare_2ebe59105a5a955361ab3dd16158746d),9

The executable seems to be utilizing the Windows .cab (cabinet) file extractor, indicated by the pdb debug file still linked to the exe as well as the version information:

language,English United States
code-page,Unicode UTF-16, little endian
CompanyName,Microsoft Corporation
FileDescription,Win32 Cabinet Self-Extractor
FileVersion,8.00.7600.16385 (win7_rtm.090713-1255)
LegalCopyright,© Microsoft Corporation. All rights reserved.
ProductName,Windows® Internet Explorer

After doing some preliminary checks using PEStudio, I noticed reference to two exe names embedded within the file, SIMURG1.EXE and AMIGO91.EXE, the latter being some form of a PUP fake web browser.

Loading the file into IDA revealed the following:

Extraction/Decompression and initialization commands: png

Calls to rundll and wfxtract. This appears to be the malware’s unpacking stage via Windows Cabinet extractor. png


With dynamic analysis of the binary, a greater understanding of the malware could have been gained. However this is outside the scope of this post. Static analysis was sufficient to gather all the relevant information.

It can now be gathered that the purpose of this malware was to use the Windows Cab installer to drop a payload to the disk and execute it.

It also appears that most of the other files of the sample data set were also using some form of extraction, with several others using the Windows Cabinet extractor.

As you can see, resolving the behavior of a set of unique pieces of malware to a common link is incredibly useful for research and security. It enables fast traversal of emerging threats allowing researchers to engineer a defense much faster.

I hope that you found this post interesting, it was fun to make. A best next step would be to re-create this in C++ and improve the efficiency of the code.

Twitter, Facebook