The Rapid7 Project Sonar datasets are amazing resources. They represent scans across the internet, compressed and easy to download. This blog post will focus on two of these datasets:
https://opendata.rapid7.com/sonar.rdns_v2/ (rdns)
https://opendata.rapid7.com/sonar.fdns_v2/ (fdns_a)
Unfortunately, working with these datasets can be a bit slow as the rdns and fdns_a datasets each contain over 10GB of compressed text. My old workflow for using these datasets was not efficient:
ubuntu@client:~$ time gunzip -c fdns_a.json.gz | grep "erbbysam.com"
{"timestamp":"1535127239","name":"blog.erbbysam.com","type":"a","value":"54.190.33.125"}
{"timestamp":"1535133613","name":"erbbysam.com","type":"a","value":"104.154.120.133"}
{"timestamp":"1535155246","name":"www.erbbysam.com","type":"cname","value":"erbbysam.com"}
real 11m31.393s
user 12m29.212s
sys 1m37.672s
I suspected there had to be a faster way of searching these two datasets.
(TLDR, reverse and sort domains then binary search)
DNS Structure
A defining features of the DNS system is its tree-like structure. Visiting this page, you are three levels below the root domain:
com
com.erbbysam
com.erbbysam.blog
The grep query above looks for a domain name tied to the root domain, not an arbitrary string in the file. If we could shape our dataset into a DNS tree, an equivalent lookup would just require a quick traversal of this tree.
Binary Search
The task of transforming a large dataset into a tree on disk and traversing this tree can be simplified further using a binary search algorithm.
The first step in using a binary search algorithm is to sort the data. One option, matching for format above, is the form “com.erbbysam.blog”. This would require a slightly more complex DNS reversal algorithm than neccessary. To simplify, reverse each line instead:
moc.masybbre.golb,521.33.091.4
moc.masybbre,331.021.451.40
moc.masybbre.www,moc.masybbre
There are no one-command solutions to sort a dataset that does not fit into memory (that I am aware of). To sort these large files, split the data into sorted chunks and then merge the results together:
# fetch the fdns_a file
wget -O fdns_a.gz https://opendata.rapid7.com/sonar.fdns_v2/2019-01-25-1548417890-fdns_a.json.gz
# extract and format our data
gunzip -c fdns_a.gz | jq -r '.value + ","+ .name' | tr '[:upper:]' '[:lower:]' | rev > fdns_a.rev.lowercase.txt
# split the data into chunks to sort
# via https://unix.stackexchange.com/a/350068 -- split and merge code
split -b100M fdns_a.rev.lowercase.txt fileChunk
# remove the old files
rm fdns_a.gz
rm fdns_a.rev.lowercase.txt
# Sort each of the pieces and delete the unsorted one
# via https://unix.stackexchange.com/a/35472 -- use LC_COLLATE=C to sort ., chars
for f in fileChunk*; do LC_COLLATE=C sort "$f" > "$f".sorted && rm "$f"; done
# merge the sorted files with local tmp directory
mkdir -p sorttmp
LC_COLLATE=C sort -T sorttmp/ -muo fdns_a.sort.txt fileChunk*.sorted
# clean up
rm fileChunk*
More detailed instructions for running this script and the rdns equivalent can be found here:
https://github.com/erbbysam/DNSGrep#run
DNSGrep
Now we can search the data! To accomplish this, I built a simple golang utility that can be found here:
https://github.com/erbbysam/DNSGrep
ubuntu@client:~$ ls -lath fdns_a.sort.txt
-rw-rw-r-- 1 ubuntu ubuntu 68G Feb 3 09:11 fdns_a.sort.txt
ubuntu@client:~$ time ./dnsgrep -f fdns_a.sort.txt -i "erbbysam.com"
104.154.120.133,erbbysam.com
54.190.33.125,blog.erbbysam.com
erbbysam.com,www.erbbysam.com
real 0m0.002s
user 0m0.000s
sys 0m0.000s
That is significantly faster!
The algorithm is pretty simple:
- Use a binary search algorithm to seek through the file, looking for a substring match against the query.
- Once a match is found, the file is scanned backwards in 10KB increments looking for a non-matching substring.
- Once a non-matching substring is found, the file is scanned forwards until all exact matches are returned.
PoC
PoC disclaimer: There is no uptime/performance guarantee of this service and I likely will take this offline at some point in the future. Keep in mind that the datasets here are from a scan on 1/25/19 — DNS records may have changed by the time you read this.
As these queries are so quick, I set up an AWS EC2 t2.micro instance with a spinning disk (Cold HDD sc1) and hosted a server that allows queries into these datasets:
https://github.com/erbbysam/DNSGrep/tree/master/experimentalServer
https://dns.bufferover.run/dns?q=erbbysam.com
ubuntu@client:~$ curl 'https://dns.bufferover.run/dns?q=erbbysam.com'
{
"Meta": {
"Runtime": "0.000361 seconds",
"Errors": [
"rdns error: failed to find exact match via binary search"
],
"FileNames": [
"2019-01-25-1548417890-fdns_a.json.gz",
"2019-01-30-1548868121-rdns.json.gz"
],
"TOS": "The source of this data is Rapid7 Labs. Please review the Terms of Service: https://opendata.rapid7.com/about/"
},
"FDNS_A": [
"104.154.120.133,erbbysam.com",
"54.190.33.125,blog.erbbysam.com",
"erbbysam.com,www.erbbysam.com"
],
"RDNS": null
}
Having a bit of fun with this, I queried every North Korean domain name, grepping for the IPs not in North Korean IP space:
https://dns.bufferover.run/dns?q=.kp
ubuntu@client:~$ curl 'https://dns.bufferover.run/dns?q=.kp' 2> /dev/null | grep -v "\"175\.45\.17"
{
"Meta": {
"Runtime": "0.000534 seconds",
"Errors": null,
"FileNames": [
"2019-01-25-1548417890-fdns_a.json.gz",
"2019-01-30-1548868121-rdns.json.gz"
],
"TOS": "The source of this data is Rapid7 Labs. Please review the Terms of Service: https://opendata.rapid7.com/about/"
},
"FDNS_A": [
"175.45.0.178,ns1.portal.net.kp",
],
"RDNS": [
"185.33.146.18,north-korea.kp",
"66.23.232.124,sverjd.ouie.kp",
"64.86.226.78,ns2.friend.com.kp",
"103.120.178.114,dedi.kani28test.kp",
"198.98.49.51,gfw.kp",
"185.86.149.212,hey.kp"
]
}
That’s it! Hopefully this was useful! Give it a try: https://dns.bufferover.run/dns?q=<hostname>