Part 2: Parse the response

We got a response to our DNS query for example.com. But what does it say? Let’s find out! Here’s the response we got:

response = b'`V\x81\x80\x00\x01\x00\x01\x00\x00\x00\x00\x03www\x07example\x03com\x00\x00\x01\x00\x01\xc0\x0c\x00\x01\x00\x01\x00\x00R\x9b\x00\x04]\xb8\xd8"'

Our goal is to write a parse_response function that parses this response into a friendly Python object we can explore.

We’ll need the code we wrote in Part 1: let’s import it

from part_1 import build_query, DNSQuestion, DNSHeader

2.1: define our DNSRecord class

The answer to our query is going to be in a DNS Record, so we need to define one more class.

from dataclasses import dataclass 

@dataclass
class DNSRecord:
    name: bytes
    type_: int
    class_: int
    ttl: int
    data: bytes 

The fields here are:

  • name: the domain name

  • type_: A, AAAA, MX, NS, TXT, etc (encoded as an integer)

  • class: always the same (1). We’ll ignore this.

  • ttl: how long to cache the query for. We’ll ignore this.

  • data: the record’s content, like the IP address.

2.2: parse the DNS header

First, we need to parse the DNS header. Here’s the code to do that:

import struct

def parse_header(reader):
    items = struct.unpack("!HHHHHH", reader.read(12))
    # see "a note on BytesIO" for an explanation of `reader` here
    return DNSHeader(*items)

This mirrors our code from header_to_bytes in Part 1.2: the format string (!HHHHHH) is exactly the same. Each of the 6 fields is a 2-byte integer, so there are 12 bytes in all to read.

Let’s try it out!

from io import BytesIO
reader = BytesIO(response)
parse_header(reader)
DNSHeader(id=24662, flags=33152, num_questions=1, num_answers=1, num_authorities=0, num_additionals=0)

We’re already getting somewhere! Our response has:

  • an ID of 24662

  • some flags (which we’re going to ignore)

  • 1 question

  • 1 answer

a note on BytesIO

This reader argument to parse_header is a BytesIO object. BytesIO lets you keep a pointer to the current position in a byte stream and lets you read from it and advance the pointer.

This is super convenient and it’s going to let us write code like

reader = BytesIO(request)
header = parse_header(reader)
question = parse_question(reader)

2.3: parse the domain name (wrong)

Next, we have to parse the question. Here’s the question section of the query, and you can see it that it starts with a domain name (www.example.com)

question = reader.read(21)
question
b'\x03www\x07example\x03com\x00\x00\x01\x00\x01'

So really our next task is to parse a domain name. First, here’s a simple version that doesn’t quite work:

def decode_name_simple(reader):
    parts = []
    while (length := reader.read(1)[0]) != 0:
        parts.append(reader.read(length))
    return b".".join(parts)

This:

  • reads a 1-byte length

  • reads that many bytes

  • repeats until the length is 0

  • concatenates all the parts together with a . between each one (['example', 'com'] => 'example.com')

Let’s use this function to parse the question section.

2.4: parse the question

def parse_question(reader):
    name = decode_name_simple(reader)
    data = reader.read(4)
    type_, class_ = struct.unpack("!HH", data)
    return DNSQuestion(name, type_, class_)
from io import BytesIO
reader = BytesIO(response)
parse_header(reader)
parse_question(reader)
DNSQuestion(name=b'www.example.com', type_=1, class_=1)

Here the type is 1 (which stands for “A”, IP Address), and the class is 1.

2.5: parse the record

Now we’re ready to try to parse the record. Here’s where our decode_name_simple function is going to break down, but we’ll try it anyway:

def parse_record(reader):
    name = decode_name_simple(reader)
    # the the type, class, TTL, and data length together are 10 bytes (2 + 2 + 4 + 2 = 10)
    # so we read 10 bytes
    data = reader.read(10)
    # HHIH means 2-byte int, 2-byte-int, 4-byte int, 2-byte int
    type_, class_, ttl, data_len = struct.unpack("!HHIH", data) 
    data = reader.read(data_len)
    return DNSRecord(name, type_, class_, ttl, data)

The record format is defined in section 4.1.2 of RFC 1035.

We can run our parse_record code like this, and see it fail:

reader = BytesIO(response)
parse_header(reader)
parse_question(reader)
parse_record(reader)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In [18], line 4
      2 parse_header(reader)
      3 parse_question(reader)
----> 4 parse_record(reader)

Cell In [17], line 2, in parse_record(reader)
      1 def parse_record(reader):
----> 2     name = decode_name_simple(reader)
      3     # the the type, class, TTL, and data length together are 10 bytes (2 + 2 + 4 + 2 = 10)
      4     # so we read 10 bytes
      5     data = reader.read(10)

Cell In [14], line 3, in decode_name_simple(reader)
      1 def decode_name_simple(reader):
      2     parts = []
----> 3     while (length := reader.read(1)[0]) != 0:
      4         parts.append(reader.read(length))
      5     return b".".join(parts)

IndexError: index out of range

thwarted by DNS compression

Oops! It failed. What’s happening here is – if you modify decode_name_simple to print out the length, you’ll see at some point that it prints out a length of 192.

But there’s no domain name segment here with a length of 192: the maximum length of each part is 63! The first 2 bits of the byte 192 (11000000 in binary) are 11, and any length that starts with the bits 11 is code for “this is compressed”.

This is happening because our DNS response contains many copies of the same domain name, and so DNS uses a simple form of compression to save space. This didn’t show up when parsing the question because the question earlier only had 1 copy of the domain name example.com in it.

So let’s look at the real version of this function, which handles compressed responses. You can find DNS compression in the specification here: RFC 1035, section 4.1.4.

2.6: implement DNS compression

Here’s what the real decode_name function looks like. It’s the most complicated thing in DNS parsing.

def decode_name(reader):
    parts = []
    while (length := reader.read(1)[0]) != 0:
        if length & 0b1100_0000:
            parts.append(decode_compressed_name(length, reader))
            break
        else:
            parts.append(reader.read(length))
    return b".".join(parts)


def decode_compressed_name(length, reader):
    pointer_bytes = bytes([length & 0b0011_1111]) + reader.read(1)
    pointer = struct.unpack("!H", pointer_bytes)[0]
    current_pos = reader.tell()
    reader.seek(pointer)
    result = decode_name(reader)
    reader.seek(current_pos)
    return result

What’s going on here is:

  1. Every time we get a length, we check if the first 2 bits are 1s. (like we said before, the maximum length of a component of a DNS name is 63 characters, so in a normal DNS name part the top 2 bits will never be set)

  2. If so, call decode_compressed_name, which:

  • takes the bottom 6 bits of the length byte, plus the next byte, and converts that to an integer called pointer

  • saves our current position in reader

  • goes to the pointer position in the DNS packet and decodes a name

  • restores the current position in reader

  • returns the name

  1. A compressed name is never followed by another label, so after decompressing the label we immediately return.

This code as implemented actually has a security vulnerability – see Exercise 3 for more about that.

2.7: finish our DNSRecord parsing

Here’s the final parse_record function. We’ve just replaced decode_name_simple in the version from part 2.5 with the new decode_name.

def parse_record(reader):
    name = decode_name(reader)
    data = reader.read(10)
    type_, class_, ttl, data_len = struct.unpack("!HHIH", data)
    data = reader.read(data_len)
    return DNSRecord(name, type_, class_, ttl, data)

Let’s test that it works:

reader = BytesIO(response)
parse_header(reader)
parse_question(reader)
parse_record(reader)
DNSRecord(name=b'www.example.com', type_=1, class_=1, ttl=21147, data=b']\xb8\xd8"')

Hooray!

2.8: parse our DNS packet

Now that we know how to parse each of the pieces, we can put it all together and parse our entire DNS packet.

Previously we were parsing 1 header, 1 question, and 1 record, but that’s actually not how DNS packets work in general: the header has a bunch of numbers (num_questions, num_answers, num_additionals, and num_authorities) that tell us how many records to expect in each section of the packet.

So we should respect that.

Let’s make a class to hold all of the contents of our DNS packet (the header, the questions, and all the records):

from typing import List

@dataclass
class DNSPacket:
    header: DNSHeader
    questions: List[DNSQuestion]
    # don't worry about the exact meaning of these 3 record
    # sections for now: we'll use them in Part 3
    answers: List[DNSRecord]
    authorities: List[DNSRecord]
    additionals: List[DNSRecord]

And here’s the final parsing code:

def parse_dns_packet(data):
    reader = BytesIO(data)
    header = parse_header(reader)
    questions = [parse_question(reader) for _ in range(header.num_questions)]
    answers = [parse_record(reader) for _ in range(header.num_answers)]
    authorities = [parse_record(reader) for _ in range(header.num_authorities)]
    additionals = [parse_record(reader) for _ in range(header.num_additionals)]

    return DNSPacket(header, questions, answers, authorities, additionals)
packet = parse_dns_packet(response)
packet
DNSPacket(header=DNSHeader(id=24662, flags=33152, num_questions=1, num_answers=1, num_authorities=0, num_additionals=0), questions=[DNSQuestion(name=b'www.example.com', type_=1, class_=1)], answers=[DNSRecord(name=b'www.example.com', type_=1, class_=1, ttl=21147, data=b']\xb8\xd8"')], authorities=[], additionals=[])

Now, let’s try to look at the IP address in this response. What’s the IP for www.example.com?

ip = packet.answers[0].data
ip
b']\xb8\xd8"'

Hmm. Looks like we still have a little bit of work to do.

a note on printing binary data

The IP address in the previous record is being printed as b']\xb8\xd8"'. What are the [ and " doing there?

When Python prints out binary strings, by default it tries to decode their contents as ASCII text when possible. Sometimes this is useful, like this:

response[12:30]
b'\x03www\x07example\x03com\x00\x00'

There, you can read www, example, and com, which makes the binary data a little easier to read because those parts of the data actually are text.

But in the case of b']\xb8\xd8", it’s not very helpful to know that the first character is an ] in ASCII because the ] byte doesn’t actually represent text. Here are a few other ways to approach printing it:

ip_address = b']\xb8\xd8"'
print(ip_address) # the default way
print(ip_address.hex()) # as hexadecimal
print([x for x in ip_address]) # as an array of 4 numbers in base 10
b']\xb8\xd8"'
5db8d822
[93, 184, 216, 34]

In this case the IP address is 93.184.216.34, so the last representation is actually the most readable. Let’s write some code to pretty print the IP address.

2.9: pretty print the IP address

When we get an IPv4 address in a DNS response, it’s not formatted as “1.2.3.4” – instead it’s 4 bytes (1, 2, 3, and 4). So to make it a string we need to pretty print it.

This is pretty simple to do: ip is a byte string of length 4:

ip[0], ip[1], ip[2], ip[3]
(93, 184, 216, 34)

and the IP address this translates to is 93.184.216.34. Here’s a function to translate the IP to a string:

def ip_to_string(ip):
    return ".".join([str(x) for x in ip])
ip_to_string(packet.answers[0].data)
'93.184.216.34'

2.10: test out all our code

Let’s write a little function to look up any domain name using 8.8.8.8 and print out the IP address.

import socket

TYPE_A = 1

def lookup_domain(domain_name):
    query = build_query(domain_name, TYPE_A)
    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    sock.sendto(query, ("8.8.8.8", 53))

    # get the response
    data, _ = sock.recvfrom(1024)
    response = parse_dns_packet(data)
    return ip_to_string(response.answers[0].data)

This builds the query, sends it to 8.8.8.8, parses the response, and pretty prints the IP address.

Let’s try it out on a few domain names!

lookup_domain("example.com")
'93.184.216.34'
lookup_domain("recurse.com")
'13.225.195.117'
lookup_domain("metafilter.com")
'54.203.56.158'

This parsing code is enough to get us to the next part: writing our DNS resolver!

This code is far from perfect – there are some pretty serious bugs, like this one:

lookup_domain("www.facebook.com")
'9.115.116.97.114.45.109.105.110.105.4.99.49.48.114.192.16'

or this one:

lookup_domain("www.metafilter.com")
'192.16'

But I’ll leave those as a puzzle for you to solve if you want (hint: look at the record type!)