summaryrefslogtreecommitdiff
path: root/_posts/2015-03-11-rabbit-hole.md
diff options
context:
space:
mode:
Diffstat (limited to '_posts/2015-03-11-rabbit-hole.md')
-rw-r--r--_posts/2015-03-11-rabbit-hole.md165
1 files changed, 165 insertions, 0 deletions
diff --git a/_posts/2015-03-11-rabbit-hole.md b/_posts/2015-03-11-rabbit-hole.md
new file mode 100644
index 0000000..97c2b80
--- /dev/null
+++ b/_posts/2015-03-11-rabbit-hole.md
@@ -0,0 +1,165 @@
+---
+title: Rabbit Hole
+description: >-
+ Complex systems sometimes require complex debugging.
+---
+
+We've begun rolling out [SkyDNS][skydns] at my job, which has been pretty neat.
+We're basing a couple future projects around being able to use it, and it's made
+dynamic configuration and service discovery nice and easy.
+
+This post chronicles catching a bug because of our switch to SkyDNS, and how we
+discover its root cause. I like to call these kinds of bugs "rabbit holes"; they
+look shallow at first, but anytime you make a little progress forward a little
+more is always required, until you discover the ending somewhere totally
+unrelated to the start.
+
+## The Bug
+
+We are seeing *tons* of these in the SkyDNS log:
+
+```
+[skydns] Feb 20 17:21:15.168 INFO | no nameservers defined or name too short, can not forward
+```
+
+I fire up tcpdump to see if I can see anything interesting, and sure enough run
+across a bunch of these:
+
+```
+# tcpdump -vvv -s 0 -l -n port 53
+tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
+ ...
+ $fen_ip.50257 > $skydns_ip.domain: [udp sum ok] 16218+ A? unknown. (25)
+ $fen_ip.27372 > $skydns_ip.domain: [udp sum ok] 16218+ A? unknown. (25)
+ $fen_ip.35634 > $skydns_ip.domain: [udp sum ok] 59227+ A? unknown. (25)
+ $fen_ip.64363 > $skydns_ip.domain: [udp sum ok] 59227+ A? unknown. (25)
+```
+
+It appears that some of our front end nodes (FENs) are making tons of DNS
+fequests trying to find the A record of `unknown`. Something on our FENs is
+doing something insane and is breaking.
+
+## The FENs
+
+Hopping over to my favorite FEN we're able to see the packets in question
+leaving on a tcpdump as well, but that's not helpful for finding the root cause.
+We have lots of processes running on the FENs and any number of them could be
+doing something crazy.
+
+We fire up sysdig, which is similar to systemtap and strace in that it allows
+you to hook into the kernel and view various kernel activites in real time, but
+it's easier to use than both. The following command dumps all UDP packets being
+sent and what process is sending them:
+
+```
+# sysdig fd.l4proto=udp
+...
+2528950 22:17:35.260606188 0 php-fpm (21477) < connect res=0 tuple=$fen_ip:61173->$skydns_ip:53
+2528961 22:17:35.260611327 0 php-fpm (21477) > sendto fd=102(<4u>$fen_ip:61173->$skydns_ip:53) size=25 tuple=NULL
+2528991 22:17:35.260631917 0 php-fpm (21477) < sendto res=25 data=.r...........unknown.....
+2530470 22:17:35.261879032 0 php-fpm (21477) > ioctl fd=102(<4u>$fen_ip:61173->$skydns_ip:53) request=541B argument=7FFF82DC8728
+2530472 22:17:35.261880574 0 php-fpm (21477) < ioctl res=0
+2530474 22:17:35.261881226 0 php-fpm (21477) > recvfrom fd=102(<4u>$fen_ip:61173->$skydns_ip:53) size=1024
+2530476 22:17:35.261883424 0 php-fpm (21477) < recvfrom res=25 data=.r...........unknown..... tuple=$skydns_ip:53->$fen_ip:61173
+2530485 22:17:35.261888997 0 php-fpm (21477) > close fd=102(<4u>$fen_ip:61173->$skydns_ip:53)
+2530488 22:17:35.261892626 0 php-fpm (21477) < close res=0
+```
+
+Aha! We can see php-fpm is requesting something over udp with the string
+`unknown` in it. We've now narrowed down the guilty process, the rest should be
+easy right?
+
+## Which PHP?
+
+Unfortunately we're a PHP shop; knowing that php-fpm is doing something on a FEN
+narrows down the guilty codebase little. Taking the FEN out of our load-balancer
+stops the requests for `unknown`, so we *can* say that it's some user-facing
+code that is the culprit. Our setup on the FENs involves users hitting nginx
+for static content and nginx proxying PHP requests back to php-fpm. Since all
+our virtual domains are defined in nginx, we are able to do something horrible.
+
+On the particular FEN we're on we make a guess about which virtual domain the
+problem is likely coming from (our main app), and proxy all traffic from all
+other domains to a different FEN. We still see requests for `unknown` leaving
+the box, so we've narrowed the problem down a little more.
+
+## The Despair
+
+Nothing in our code is doing any direct DNS calls as far as we can find, and we
+don't see any places PHP might be doing it for us. We have lots of PHP
+extensions in place, all written in C and all black boxes; any of them could be
+the culprit. Grepping through the likely candidates' source code for the string
+`unknown` proves fruitless.
+
+We try xdebug at this point. xdebug is a profiler for php which will create
+cachegrind files for the running code. With cachegrind you can see every
+function which was ever called, how long spent within each function, a full
+call-graph, and lots more. Unfortunately xdebug outputs cachegrind files on a
+per-php-fpm-process basis, and overwrites the previous file on each new request.
+So xdebug is pretty much useless, since what is in the cachegrind file isn't
+necessarily what spawned the DNS request.
+
+## Gotcha (sorta)
+
+We turn back to the tried and true method of dumping all the traffic using
+tcpdump and perusing through that manually.
+
+What we find is that nearly everytime there is a DNS request for `unknown`, if
+we scroll up a bit there is (usually) a particular request to memcache. The
+requested key is always in the style of `function-name:someid:otherstuff`. When
+looking in the code around that function name we find this ominous looking call:
+
+```php
+$ipAddress = getIPAddress();
+$geoipInfo = getCountryInfoFromIP($ipAddress);
+```
+
+This points us in the right direction. On a hunch we add some debug
+logging to print out the `$ipAddress` variable, and sure enough it comes back as
+`unknown`. AHA!
+
+So what we surmise is happening is that for some reason our geoip extension,
+which we use to get the location data of an IP address and which
+`getCountryInfoFromIP` calls, is seeing something which is *not* an IP address
+and trying to resolve it.
+
+## Gotcha (for real)
+
+So the question becomes: why are we getting the string `unknown` as an IP
+address?
+
+Adding some debug logging around the area we find before showed that
+`$_SERVER['REMOTE_ADDR']`, which is the variable populated with the IP address
+of the client, is sometimes `unknown`. We guess that this has something to do
+with some magic we are doing on nginx's side to populate `REMOTE_ADDR` with the
+real IP address of the client in the case of them going through a proxy.
+
+Many proxies send along the header `X-Forwarded-For` to indicate the real IP of
+the client they're proxying for, otherwise the server would only see the proxy's
+IP. In our setup I decided that in those cases we should set the `REMOTE_ADDR`
+to the real client IP so our application logic doesn't even have to worry about
+it. There are a couple problems with this which render it a bad decision, one
+being that if some misbahaving proxy was to, say, start sending
+`X-Forwarded-For: unknown` then some written applications might mistake that to
+mean the client's IP is `unknown`.
+
+## The Fix
+
+The fix here was two-fold:
+
+1) We now always set `$_SERVER['REMOTE_ADDR']` to be the remote address of the
+requests, regardless of if it's a proxy, and also send the application the
+`X-Forwarded-For` header to do with as it pleases.
+
+2) Inside our app we look at all the headers sent and do some processing to
+decide what the actual client IP is. PHP can handle a lot more complex logic
+than nginx can, so we can do things like check to make sure the IP is an IP, and
+also that it's not some NAT'd internal ip, and so forth.
+
+And that's it. From some weird log messages on our DNS servers to an nginx
+mis-configuration on an almost unrelated set of servers, this is one of those
+strange bugs that never has a nice solution and goes unsolved for a long time.
+Spending the time to dive down the rabbit hole and find the answer is often
+tedious, but also often very rewarding.
+
+[skydns]: https://github.com/skynetservices/skydns