Malicious Content Detection Platform Project Homepage

The Malicious Content Detection Platform is a research effort of the SCRUB center located at UC Berkeley and funded by Intel. The project explores the integration of expert reviewers into a periodically retrained malware detection system, making contributions to both detection performance and performance measurement. This page provides documentation for the project, as well as associated code and data.

Documentation

Reviewer Integration and Performance Measurement for Malware Detection
Brad Miller, Alex Kantchelian, Sadia Afroz, Rekha Bachwani, Riyaz Faizullabhoy, Ling Huang, Vaishaal Shankar,
Michael Carl Tschantz, Tony Wu, George Yiu, Anthony D. Joseph and J.D. Tygar.
16th Conference on Detection of Intrusions, Malware & Vulnerability Assessment (DIMVA), July 2016.

Scalable Platform for Malicious Content Detection Integrating Machine Learning and Manual Review
Brad Miller, PhD Thesis, UC Berkeley.

Code & Data Release

To facilitate future work, we are releasing both code and data (3.6GB) from the project. The project code is released under the Apache 2.0 License. Although the terms of our agreement with VirusTotal do not allow full release of all data, we are able to release 3% of the dataset as well as the full list of scan-ids used in our analysis. Since our approach is data-driven and requires a critical mass of data to produce high-performing models, the 3% of data which will release will not be sufficient to reproduce the published performance results. The performance degredation occurs because machine learning algorithms naturally perform better when they have more data available to learn from. To facilitate full reproduction of our results, we release the VirusTotal scan-ids of all data used in our analysis. Given appropriate access to VirusTotal, you may use these scan-ids to obtain the full dataset directly from VirusTotal, which you may then use to reproduce the full results of our work.

As our original dataset included approximately 778GB of uncompressed raw data, our analysis required distributed computational framework to provide parallel processing. Our implementation uses Apache Spark, and our original implementation used a cluster with 44 cores and 600GB of RAM. Since we only release 3% of the original dataset, we are able to also release a demo which runs the same code as our original analysis but in mineature on a single cell. To invoke the demo, execute ./local_driver.sh from the root directory of the tarball. The demo script drives a series of executions beginning with VirusTotal records and ending in many of the experiments which we present. To run the demo, you will need a computer wtih at least 8GB of RAM (preferably 12GB) and 20GB of available disk space. We ran the demo on a Intel Xeon W3550 (4 cores) with 12GB RAM in approximately 6 hours. As the demo executes, the Spark user interface will be accessible on localhost:4040 and our custom interface will be accessible on localhost:9321. As a reference, we provide a cached version of the final results. To view the cached final results, execute ./view_final_results.sh and navigate to localhost:9321.