Unsupervised Clustering for Identification of Malicious Domain Campaigns

Michael Weber, Jun Wang, and Yuchen Zhou

Palo Alto Networks

RESEC 2018 (ASIACCS Workshop)

Abstract
New malicious domain campaigns often include large sets of domains registered in bulk and deployed simultaneously. Early identification of these campaigns can often be accomplished with distance functions or regular expressions of registered domains, but these methods may also miss some campaign domains. Other studies have used time-of-registration features to help identify malicious domains. This paper explores the use of unsupervised clustering based on passive DNS records and other inherent network information to identify domains that may be part of campaigns but resistant to detection by domain name or time-of-registration analysis alone. We have found that using this method, we can achieve up to 2.1x expansion from a seed of known campaign domains with less than 4% false positives. This could be a useful tool to augment other methods of identifying malicious domains.
What is Passive DNS Data?
Passive DNS data can be of great use to detect malicious campaigns while other textual approaches fail. A typical passive DNS data record consists of the rdata (IP/CNAME/NXDomain), rname (domain of interest), and rrtype (record type). Two types of Passive DNS data is popular among research - aggregated and raw. Aggregated data contains number of resolutions, first/last time resolved, whereas the raw data contains time resolved (non aggregated, user can aggregate by themselves). Certain passive DNS data may also contain the source of DNS request, which may be considered sensitive at times.
We extracted multiple features from PassiveDNS data, including many that were previously used by other literatures. All features are listed in the above image.
Insight
Equifax committed a huge gaffe in mid-late 2017, resulting in the potential leakage of sensitive information of millions of users. Such impactful events, including naturual disasters like Hurricane Harvey occurs every now and then and grabs the attention of the public, including hackers. As security researchers we have seen an uptick of malicious campaigns leveraging these types of trending topics and creating phishing or exploit domains associated with these events. When textual features (or passive DNS data) alone are not enough to do early detection of such domains, this is where the combined approach shines. Our insights start from here - expand passive DNS clusters with seed campaign domains, and the results are directly contributing to Palo Alto Networks products.
A 10% seed domain threshold in DBSCAN clusters can effectively expand the malicious seed domain coverage by more than twice!
Paper
Our RESEC 2018 paper can be found at here.