{"id":75,"date":"2019-02-09T14:00:20","date_gmt":"2019-02-09T14:00:20","guid":{"rendered":"https:\/\/blog.erbbysam.com\/?p=75"},"modified":"2019-02-09T21:38:11","modified_gmt":"2019-02-09T21:38:11","slug":"dnsgrep","status":"publish","type":"post","link":"https:\/\/blog.erbbysam.com\/index.php\/2019\/02\/09\/dnsgrep\/","title":{"rendered":"DNSGrep &#8212; Quickly Searching Large DNS Datasets"},"content":{"rendered":"\n<p>The Rapid7 <a rel=\"noreferrer noopener\" aria-label=\" (opens in a new tab)\" href=\"https:\/\/opendata.rapid7.com\/\" target=\"_blank\">Project Sonar<\/a> datasets are amazing resources. They represent scans across the internet, compressed and easy to download. This blog post will focus on two of these datasets:<\/p>\n\n\n\n<p><a rel=\"noreferrer noopener\" aria-label=\" (opens in a new tab)\" href=\"https:\/\/opendata.rapid7.com\/sonar.rdns_v2\/\" target=\"_blank\">https:\/\/opendata.rapid7.com\/sonar.rdns_v2\/<\/a> (rdns)<br><a href=\"https:\/\/opendata.rapid7.com\/sonar.fdns_v2\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\" (opens in a new tab)\">https:\/\/opendata.rapid7.com\/sonar.fdns_v2\/<\/a> (fdns_a)<\/p>\n\n\n\n<p>Unfortunately, working with these datasets can be a bit slow as the rdns and fdns_a datasets each contain over 10GB of compressed text. My old workflow for using these datasets was not efficient:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code class=\"bash\">ubuntu@client:~$ time gunzip -c fdns_a.json.gz | grep \"erbbysam.com\"<br>{\"timestamp\":\"1535127239\",\"name\":\"blog.erbbysam.com\",\"type\":\"a\",\"value\":\"54.190.33.125\"}<br> {\"timestamp\":\"1535133613\",\"name\":\"erbbysam.com\",\"type\":\"a\",\"value\":\"104.154.120.133\"}<br> {\"timestamp\":\"1535155246\",\"name\":\"www.erbbysam.com\",\"type\":\"cname\",\"value\":\"erbbysam.com\"}<br> real    11m31.393s<br> user    12m29.212s<br> sys     1m37.672s<\/code><\/pre>\n\n\n\n<p>I suspected there had to be a faster way of searching these two datasets.<\/p>\n\n\n\n<p>(TLDR, reverse and sort domains then binary search)<\/p>\n\n\n\n<h2>DNS Structure<\/h2>\n\n\n\n<p>A defining features of the DNS system is its tree-like structure. Visiting this page, you are three levels below the <a rel=\"noreferrer noopener\" href=\"https:\/\/en.wikipedia.org\/wiki\/Root_name_server#Root_domain\" target=\"_blank\">root domain<\/a>:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">com<br>com.erbbysam<br>com.erbbysam.blog<\/pre>\n\n\n\n<p>The grep query above looks for a domain name tied to the root domain, not an arbitrary string in the file. <strong>If we could shape our dataset into a DNS tree, an equivalent lookup would just require a quick traversal of this tree.<\/strong><\/p>\n\n\n\n<h2>Binary Search<\/h2>\n\n\n\n<p>The task of transforming a large dataset into a tree on disk and traversing this tree can be simplified further using a <a rel=\"noreferrer noopener\" aria-label=\" (opens in a new tab)\" href=\"https:\/\/en.wikipedia.org\/wiki\/Binary_search_algorithm\" target=\"_blank\">binary search algorithm<\/a>.<\/p>\n\n\n\n<p>The first step in using a binary search algorithm is to sort the data. One option, matching for format above, is the form &#8220;com.erbbysam.blog&#8221;. This would require a slightly more complex DNS reversal algorithm than neccessary. To simplify, reverse each line instead:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">moc.masybbre.golb,521.33.091.4<br>moc.masybbre,331.021.451.40<br>moc.masybbre.www,moc.masybbre<\/pre>\n\n\n\n<p>There are no one-command solutions to sort a dataset that does not fit into memory (that I am aware of). To sort these large files, split the data into sorted chunks and then merge the results together:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code class=\"bash\"># fetch the fdns_a file\nwget -O fdns_a.gz https:\/\/opendata.rapid7.com\/sonar.fdns_v2\/2019-01-25-1548417890-fdns_a.json.gz\n\n# extract and format our data\ngunzip -c fdns_a.gz | jq -r '.value + \",\"+ .name' | tr '[:upper:]' '[:lower:]' | rev &gt; fdns_a.rev.lowercase.txt\n\n# split the data into chunks to sort\n# via https:\/\/unix.stackexchange.com\/a\/350068 -- split and merge code\nsplit -b100M fdns_a.rev.lowercase.txt fileChunk\n\n# remove the old files\nrm fdns_a.gz\nrm fdns_a.rev.lowercase.txt\n\n# Sort each of the pieces and delete the unsorted one\n# via https:\/\/unix.stackexchange.com\/a\/35472 -- use LC_COLLATE=C to sort ., chars\nfor f in fileChunk*; do LC_COLLATE=C sort \"$f\" &gt; \"$f\".sorted &amp;&amp; rm \"$f\"; done\n\n# merge the sorted files with local tmp directory\nmkdir -p sorttmp\nLC_COLLATE=C sort -T sorttmp\/ -muo fdns_a.sort.txt fileChunk*.sorted\n\n# clean up\nrm fileChunk*<\/code><\/pre>\n\n\n\n<p>More detailed instructions for running this script and the rdns equivalent can be found here:<br><a href=\"https:\/\/github.com\/erbbysam\/DNSGrep#run\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"https:\/\/github.com\/erbbysam\/DNSGrep#run (opens in a new tab)\">https:\/\/github.com\/erbbysam\/DNSGrep#run<\/a><\/p>\n\n\n\n<h2>DNSGrep<\/h2>\n\n\n\n<p>Now we can search the data! To accomplish this, I built a simple golang utility that can be found here:<br><a href=\"https:\/\/github.com\/erbbysam\/DNSGrep\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\" (opens in a new tab)\">https:\/\/github.com\/erbbysam\/DNSGrep<\/a><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code class=\"bash\">ubuntu@client:~$ ls -lath fdns_a.sort.txt\n-rw-rw-r-- 1 ubuntu ubuntu 68G Feb  3 09:11 fdns_a.sort.txt\nubuntu@client:~$ time .\/dnsgrep -f fdns_a.sort.txt -i \"erbbysam.com\"\n104.154.120.133,erbbysam.com\n54.190.33.125,blog.erbbysam.com\nerbbysam.com,www.erbbysam.com\n\nreal    0m0.002s\nuser    0m0.000s\nsys    0m0.000s\n<\/code>\n<\/pre>\n\n\n\n<p>That is significantly faster! <\/p>\n\n\n\n<p>The algorithm is pretty simple:<\/p>\n\n\n\n<ol><li>Use a binary search algorithm to seek through the file, looking for a substring match against the query.<\/li><li>Once a match is found, the file is scanned backwards in 10KB increments looking for a non-matching substring.<\/li><li>Once a non-matching substring is found, the file is scanned forwards until all exact matches are returned.<\/li><\/ol>\n\n\n\n<h2>PoC<\/h2>\n\n\n\n<p><strong>PoC<\/strong> <strong>disclaimer<\/strong>: There is no uptime\/performance guarantee of this service and I likely will take this offline at some point in the future. Keep in mind that the datasets here are from a scan on 1\/25\/19 &#8212; DNS records may have changed by the time you read this.<\/p>\n\n\n\n<p>As these queries are so quick, I set up an AWS EC2 t2.micro instance with a spinning disk (Cold HDD sc1) and hosted a server that allows queries into these datasets:<br><a href=\"https:\/\/github.com\/erbbysam\/DNSGrep\/tree\/master\/experimentalServer\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"https:\/\/github.com\/erbbysam\/DNSGrep\/tree\/master\/experimentalServer (opens in a new tab)\">https:\/\/github.com\/erbbysam\/DNSGrep\/tree\/master\/experimentalServer<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/dns.bufferover.run\/dns?q=erbbysam.com\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\" (opens in a new tab)\">https:\/\/dns.bufferover.run\/dns?q=erbbysam.com<\/a><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code class=\"bash\">ubuntu@client:~$ curl 'https:\/\/dns.bufferover.run\/dns?q=erbbysam.com' \n{\n\t\"Meta\": {\n\t\t\"Runtime\": \"0.000361 seconds\",\n\t\t\"Errors\": [\n\t\t\t\"rdns error: failed to find exact match via binary search\"\n\t\t],\n\t\t\"FileNames\": [\n\t\t\t\"2019-01-25-1548417890-fdns_a.json.gz\",\n\t\t\t\"2019-01-30-1548868121-rdns.json.gz\"\n\t\t],\n\t\t\"TOS\": \"The source of this data is Rapid7 Labs. Please review the Terms of Service: https:\/\/opendata.rapid7.com\/about\/\"\n\t},\n\t\"FDNS_A\": [\n\t\t\"104.154.120.133,erbbysam.com\",\n\t\t\"54.190.33.125,blog.erbbysam.com\",\n\t\t\"erbbysam.com,www.erbbysam.com\"\n\t],\n\t\"RDNS\": null\n}\n<\/code><\/pre>\n\n\n\n<p>Having a bit of fun with this, I queried every North Korean domain name, grepping for the IPs not in <a rel=\"noreferrer noopener\" aria-label=\" (opens in a new tab)\" href=\"https:\/\/en.wikipedia.org\/wiki\/Internet_in_North_Korea#IP_address_ranges\" target=\"_blank\">North Korean IP space<\/a>:<br><a rel=\"noreferrer noopener\" aria-label=\" (opens in a new tab)\" href=\"https:\/\/dns.bufferover.run\/dns?q=.kp\" target=\"_blank\">https:\/\/dns.bufferover.run\/dns?q=.kp<\/a><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code class=\"bash\"> ubuntu@client:~$ curl 'https:\/\/dns.bufferover.run\/dns?q=.kp' 2&gt; \/dev\/null | grep -v \"\\\"175\\.45\\.17\"\n{\n\t\"Meta\": {\n\t\t\"Runtime\": \"0.000534 seconds\",\n\t\t\"Errors\": null,\n\t\t\"FileNames\": [\n\t\t\t\"2019-01-25-1548417890-fdns_a.json.gz\",\n\t\t\t\"2019-01-30-1548868121-rdns.json.gz\"\n\t\t],\n\t\t\"TOS\": \"The source of this data is Rapid7 Labs. Please review the Terms of Service: https:\/\/opendata.rapid7.com\/about\/\"\n\t},\n\t\"FDNS_A\": [\n\t\t\"175.45.0.178,ns1.portal.net.kp\",\n\t],\n\t\"RDNS\": [\n\t\t\"185.33.146.18,north-korea.kp\",\n\t\t\"66.23.232.124,sverjd.ouie.kp\",\n\t\t\"64.86.226.78,ns2.friend.com.kp\",\n\t\t\"103.120.178.114,dedi.kani28test.kp\",\n\t\t\"198.98.49.51,gfw.kp\",\n\t\t\"185.86.149.212,hey.kp\"\n\t]\n}<\/code><\/pre>\n\n\n\n<p>That&#8217;s it! Hopefully this was useful! Give it a try: https:\/\/dns.bufferover.run\/dns?q=&lt;hostname&gt;<br><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Rapid7 Project Sonar datasets are amazing resources. They represent scans across the internet, compressed and easy to download. This blog post will focus on two of these datasets: https:\/\/opendata.rapid7.com\/sonar.rdns_v2\/ (rdns)https:\/\/opendata.rapid7.com\/sonar.fdns_v2\/ (fdns_a) Unfortunately, working with these datasets can be a bit slow as the rdns and fdns_a datasets each contain over 10GB of compressed text. &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/blog.erbbysam.com\/index.php\/2019\/02\/09\/dnsgrep\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;DNSGrep &#8212; Quickly Searching Large DNS Datasets&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.erbbysam.com\/index.php\/wp-json\/wp\/v2\/posts\/75"}],"collection":[{"href":"https:\/\/blog.erbbysam.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.erbbysam.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.erbbysam.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.erbbysam.com\/index.php\/wp-json\/wp\/v2\/comments?post=75"}],"version-history":[{"count":46,"href":"https:\/\/blog.erbbysam.com\/index.php\/wp-json\/wp\/v2\/posts\/75\/revisions"}],"predecessor-version":[{"id":125,"href":"https:\/\/blog.erbbysam.com\/index.php\/wp-json\/wp\/v2\/posts\/75\/revisions\/125"}],"wp:attachment":[{"href":"https:\/\/blog.erbbysam.com\/index.php\/wp-json\/wp\/v2\/media?parent=75"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.erbbysam.com\/index.php\/wp-json\/wp\/v2\/categories?post=75"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.erbbysam.com\/index.php\/wp-json\/wp\/v2\/tags?post=75"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}