Support

Blog

Flattr this!

When DNS goes bad

This year someone in China misconfigured something which effectively exported China’s main method of implementing blocks (man in the middle DNS spoofing) semi globally over the Global Crossing backbone for the last few weeks.

Effectively, China’s blocking, went global (for certain providers).

This is a little technical, so bear with me while I try to put in into laymans terms!

When you ask for www.somesite.com, a query is sent to your ISP’s DNS servers asking for the i.p. address. If those DNS servers don’t know, they in turn who then ask their upstream DNS servers (if they exist) and so on, until one of them will then ask the root servers who is responsible for that domain.
These root level servers are based geographically, and are the arbiters of whether a domain is resolvable or not.
If they don’t know about a domain, then essentially that domain doesn’t exist, as they are the servers that other servers rely on.
If for instance a root level server suddenly decided it didn’t know who it should send CN names to, then that entire section of the net would be unresolvable for anyone who used those root servers.

This has actually happened at least once already; Swedish .se domains dropped off the internet completely for a few hours to a day (dependent on caching) due to a misplaced full stop in October 2009.

This is not what was happened with this instance, but hey, its the _same_ company (different division) again with another DNS issue.

I’ll start with the infrastructure –

Swedish company NetNod (aka Autonomica) has a DNS root server* here in China – I.ROOT-SERVERS.NET / 192.36.148.17

*Server in this case actually refers to many servers providing a DNS service as i.root-servers.net
i.root-servers.net servers are geographically located all over Asia (and other places).

(See below for a map)

A root server as stated, is almost the final arbiter of any DNS lookup. It knows which servers service top level domains (TLD’s). So its the one that gets to tell your DNS server where .com, .net, .cn, .hk, queries should be sent to.
All these root servers also provide caching, so if (as is probable), someone else asks for that domain again, it knows how to answer.

Netnod, like other companies that provide root level servers, use a mechanism called anycast to deliver users to the best destination server for the DNS query.

[ From Wikipedia – In anycast, there is a one-to-many association between network addresses and network endpoints: each destination address identifies a set of receiver endpoints, but only one of them is chosen at any given time to receive information from any given sender. ]

Anycast operates over BGP to delegate best routing to a destination based on AS (automated system) rules.
Anycast by design, is inherently insecure, as anyone at the right stage of the chain can intercept packets for the anycast address.  This is really able to be done by routers at the BGP level of routing, so AS owners rely on each other not to mess around.
Essentially, if you are trusted enough to have an AS, you are trusted enough not to screw up.

[ From Wikipedia – On the Internet, anycast is usually implemented by using BGP to simultaneously announce the same destination IP address range from many different places on the Internet. This results in packets addressed to destination addresses in this range being routed to the “nearest” point on the net announcing the given destination IP address.

AS – An autonomous system (AS) is a collection of connected IP routing prefixes under the control of one or more network operator that presents a common, clearly defined routing policy to the Internet (cf. RFC 1930, Section 3.]

Ok, so now you have a pseudo glossed over idea about BGP, AS, and Anycast, I can continue 🙂

Computers in other countries (mostly on Global Crossing networks, as noted above) were starting to get spurious DNS results.

If you remember above, the NetNod root server based in China, uses AnyCast via BGP to talk to things asking about DNS.  If we look at the BGP routing for the I.ROOT-SERVERS.NET, we can get an idea of how things are laid out from a network perspective

I.ROOT-SERVERS.NET sits in AS29216
Robtex (which is unfortunately blocked in China), shows the connectivity for that block here –
http://www.robtex.com/as/as29216.html#graph

AS29216 apparently only links to AS8674 (NETNOD-IX).
That AS block talks to quite a few others, including one named AS24151.
AS24151 is controlled by CNNIC.

CNNIC is a China government run .cn domain management organization*
(*In practice. They may or may not be government owned in what passes for “reality” here).

What happened (allegedly, as I haven’t read up completely about this on the dns-operations list), is that another DNS server upstream of AS8674 (most probably on AS2151) came along and said hey! I’m a root level server.

This “rogue” root server sat in the anycast block in use by I.ROOT-SERVERS.NET, and advertised themselves as a root node, randomly intercepting traffic (as by design this is supposed to happen in Anycast).  This shouldn’t happen, but as AS2151 is trusted by the other AS’s they accepted its announcement about having a root node server, and the other nodes started caching its queries.

This started causing all sorts of sporadic mischief, as other servers started caching those “bad” (China firewalled) results from the I.ROOT-SERVERS.NET rogue server.

Normally this would be a regional issue, but as BGP is not best distance based, but AS based, other AS’s close (from a network perspective) that also use Anycast via BGP would take answers from that node for queries too.

This “rogue” root node was configured to do its DNS in standard China Firewall style, and null route/ block servers – eg Youtube, Facebook, Twitter…

As it was responding to DNS queries via Anycast in the Root level server AS, other secondary DNS servers and upward were querying it, caching the bad responses, and then null routing those major US based internet services within their own regions.

This started happening intermittently over Global Crossing nodes until the problem was spotted and resolved.

Users locations as far away as California, Chile, and China (although admittedly here its broken by design) were getting DNS results “China style”.

Lots of finger pointing went on until the people running NetNod/Autonomica eventually twigged that this was happening, and stopped accepting BGP / Anycast routes from AS2151 at AS29216, which meant that things would get back to normal after caching expired.

Its quite possible this was just a screwup on someones part here within CNNIC, but the tin foil hat wearing brigade may think otherwise.  Personally I put this down to either testing purposes, or user error.  Understanding the intricacies of implementation and its implications is harder than it first appears, and its easy to screw up.  That said, it did take a lot of coincidence for this to happen like that, and acting like a root server would put a noticeable amount of additional load on the server(s) doing the replies, so it would be noticed as least on that level.

This isn’t the first time this *exact* issue has happened on a global scale either. Network operators in Pakistan did a similar thing in 2008 which affected Youtube globally, with users getting similar bad routing as far away as UK.

What does this mean for the future?

Trust is a delicate issue, and it looks like people will eventually no longer implicitly trust upstream or downstream providers on BGP to do the right thing.

Ironically Autonomica / NetNod are some of the people involved with making sure this kind of thing *doesn’t* happen again!

Autonomica are involved quite heavily in something called DNSSEC.

DNS queries don’t mandate security, so a query can be resolved by a server in the right place, at the wrong time (as seen above). With DNSSEC, the queried server will be the one that answers you using a signed key, so any rogue server in place should not be able to work as it doesn’t have the correct credentials. It also means that a rogue server would be more easily spotted as the keys can be readily identified for a given server.

DNSSEC is in the process of being rolled out, and it looks like things like this will only mandate the rollout goes faster.

Unsuprisingly China is not quite convinced this is a solution, mostly I suspect, as this will break their DNS firewalling methods..

DNSSec rollout map, and a rather excellent talk about this and other DNS issues by Paul Wouter is here:

http://www.xelerance.com/dnssec/

http://www.xelerance.com/talks/sector/Sector2007DNSSEC.pdf

China is also not without its own DNS issues (other than the deliberately implemented ones) as anyone who lives here has experienced.

Last year May saw most of China’s DNS completely collapse for a day as provider DNSPOD was subject to an inadvertent DoS attack via queries against Baofeng.com. Good PDF on that here by China Telecom Guangzhou staff Ziqian Liu – https://www.dns-oarc.net/files/workshop-200911/Ziqian_Liu.pdf


Further reading and research materials below:

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6VNT-4S807WG-G&_user=10&_coverDate=04%2F30%2F2008&_rdoc=1&_fmt=high&_orig=browse&_sort=d&view=c&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=ccc0471388f3fb33fcecdd3409f4f9cc Pakistan DNS security weakness

http://en.wikipedia.org/wiki/DNSSEC DNS Security

http://royal.pingdom.com/2009/10/13/sweden%25E2%2580%2599s-internet-broken-by-dns-mistake/ Sweden disappears from the net

http://www.netnod.se/dns_root_nameserver.shtml – NetNod’s website

http://www.isoc.org/briefings/020/ – DNS Root server FAQ’s

http://blogs.csoonline.com/1179/chile_nic_explains_great_firewall_incident

https://lists.dns-oarc.net/pipermail/dns-operations/2010-March/005267.html – DNS issue list where this was noted.