diff options
Diffstat (limited to 'static/src/_posts/2015-03-11-rabbit-hole.md')
-rw-r--r-- | static/src/_posts/2015-03-11-rabbit-hole.md | 166 |
1 files changed, 166 insertions, 0 deletions
diff --git a/static/src/_posts/2015-03-11-rabbit-hole.md b/static/src/_posts/2015-03-11-rabbit-hole.md new file mode 100644 index 0000000..0bea0b5 --- /dev/null +++ b/static/src/_posts/2015-03-11-rabbit-hole.md @@ -0,0 +1,166 @@ +--- +title: Rabbit Hole +description: >- + Complex systems sometimes require complex debugging. +tags: tech +--- + +We've begun rolling out [SkyDNS][skydns] at my job, which has been pretty neat. +We're basing a couple future projects around being able to use it, and it's made +dynamic configuration and service discovery nice and easy. + +This post chronicles catching a bug because of our switch to SkyDNS, and how we +discover its root cause. I like to call these kinds of bugs "rabbit holes"; they +look shallow at first, but anytime you make a little progress forward a little +more is always required, until you discover the ending somewhere totally +unrelated to the start. + +## The Bug + +We are seeing *tons* of these in the SkyDNS log: + +``` +[skydns] Feb 20 17:21:15.168 INFO | no nameservers defined or name too short, can not forward +``` + +I fire up tcpdump to see if I can see anything interesting, and sure enough run +across a bunch of these: + +``` +# tcpdump -vvv -s 0 -l -n port 53 +tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes + ... + $fen_ip.50257 > $skydns_ip.domain: [udp sum ok] 16218+ A? unknown. (25) + $fen_ip.27372 > $skydns_ip.domain: [udp sum ok] 16218+ A? unknown. (25) + $fen_ip.35634 > $skydns_ip.domain: [udp sum ok] 59227+ A? unknown. (25) + $fen_ip.64363 > $skydns_ip.domain: [udp sum ok] 59227+ A? unknown. (25) +``` + +It appears that some of our front end nodes (FENs) are making tons of DNS +fequests trying to find the A record of `unknown`. Something on our FENs is +doing something insane and is breaking. + +## The FENs + +Hopping over to my favorite FEN we're able to see the packets in question +leaving on a tcpdump as well, but that's not helpful for finding the root cause. +We have lots of processes running on the FENs and any number of them could be +doing something crazy. + +We fire up sysdig, which is similar to systemtap and strace in that it allows +you to hook into the kernel and view various kernel activites in real time, but +it's easier to use than both. The following command dumps all UDP packets being +sent and what process is sending them: + +``` +# sysdig fd.l4proto=udp +... +2528950 22:17:35.260606188 0 php-fpm (21477) < connect res=0 tuple=$fen_ip:61173->$skydns_ip:53 +2528961 22:17:35.260611327 0 php-fpm (21477) > sendto fd=102(<4u>$fen_ip:61173->$skydns_ip:53) size=25 tuple=NULL +2528991 22:17:35.260631917 0 php-fpm (21477) < sendto res=25 data=.r...........unknown..... +2530470 22:17:35.261879032 0 php-fpm (21477) > ioctl fd=102(<4u>$fen_ip:61173->$skydns_ip:53) request=541B argument=7FFF82DC8728 +2530472 22:17:35.261880574 0 php-fpm (21477) < ioctl res=0 +2530474 22:17:35.261881226 0 php-fpm (21477) > recvfrom fd=102(<4u>$fen_ip:61173->$skydns_ip:53) size=1024 +2530476 22:17:35.261883424 0 php-fpm (21477) < recvfrom res=25 data=.r...........unknown..... tuple=$skydns_ip:53->$fen_ip:61173 +2530485 22:17:35.261888997 0 php-fpm (21477) > close fd=102(<4u>$fen_ip:61173->$skydns_ip:53) +2530488 22:17:35.261892626 0 php-fpm (21477) < close res=0 +``` + +Aha! We can see php-fpm is requesting something over udp with the string +`unknown` in it. We've now narrowed down the guilty process, the rest should be +easy right? + +## Which PHP? + +Unfortunately we're a PHP shop; knowing that php-fpm is doing something on a FEN +narrows down the guilty codebase little. Taking the FEN out of our load-balancer +stops the requests for `unknown`, so we *can* say that it's some user-facing +code that is the culprit. Our setup on the FENs involves users hitting nginx +for static content and nginx proxying PHP requests back to php-fpm. Since all +our virtual domains are defined in nginx, we are able to do something horrible. + +On the particular FEN we're on we make a guess about which virtual domain the +problem is likely coming from (our main app), and proxy all traffic from all +other domains to a different FEN. We still see requests for `unknown` leaving +the box, so we've narrowed the problem down a little more. + +## The Despair + +Nothing in our code is doing any direct DNS calls as far as we can find, and we +don't see any places PHP might be doing it for us. We have lots of PHP +extensions in place, all written in C and all black boxes; any of them could be +the culprit. Grepping through the likely candidates' source code for the string +`unknown` proves fruitless. + +We try xdebug at this point. xdebug is a profiler for php which will create +cachegrind files for the running code. With cachegrind you can see every +function which was ever called, how long spent within each function, a full +call-graph, and lots more. Unfortunately xdebug outputs cachegrind files on a +per-php-fpm-process basis, and overwrites the previous file on each new request. +So xdebug is pretty much useless, since what is in the cachegrind file isn't +necessarily what spawned the DNS request. + +## Gotcha (sorta) + +We turn back to the tried and true method of dumping all the traffic using +tcpdump and perusing through that manually. + +What we find is that nearly everytime there is a DNS request for `unknown`, if +we scroll up a bit there is (usually) a particular request to memcache. The +requested key is always in the style of `function-name:someid:otherstuff`. When +looking in the code around that function name we find this ominous looking call: + +```php +$ipAddress = getIPAddress(); +$geoipInfo = getCountryInfoFromIP($ipAddress); +``` + +This points us in the right direction. On a hunch we add some debug +logging to print out the `$ipAddress` variable, and sure enough it comes back as +`unknown`. AHA! + +So what we surmise is happening is that for some reason our geoip extension, +which we use to get the location data of an IP address and which +`getCountryInfoFromIP` calls, is seeing something which is *not* an IP address +and trying to resolve it. + +## Gotcha (for real) + +So the question becomes: why are we getting the string `unknown` as an IP +address? + +Adding some debug logging around the area we find before showed that +`$_SERVER['REMOTE_ADDR']`, which is the variable populated with the IP address +of the client, is sometimes `unknown`. We guess that this has something to do +with some magic we are doing on nginx's side to populate `REMOTE_ADDR` with the +real IP address of the client in the case of them going through a proxy. + +Many proxies send along the header `X-Forwarded-For` to indicate the real IP of +the client they're proxying for, otherwise the server would only see the proxy's +IP. In our setup I decided that in those cases we should set the `REMOTE_ADDR` +to the real client IP so our application logic doesn't even have to worry about +it. There are a couple problems with this which render it a bad decision, one +being that if some misbahaving proxy was to, say, start sending +`X-Forwarded-For: unknown` then some written applications might mistake that to +mean the client's IP is `unknown`. + +## The Fix + +The fix here was two-fold: + +1) We now always set `$_SERVER['REMOTE_ADDR']` to be the remote address of the +requests, regardless of if it's a proxy, and also send the application the +`X-Forwarded-For` header to do with as it pleases. + +2) Inside our app we look at all the headers sent and do some processing to +decide what the actual client IP is. PHP can handle a lot more complex logic +than nginx can, so we can do things like check to make sure the IP is an IP, and +also that it's not some NAT'd internal ip, and so forth. + +And that's it. From some weird log messages on our DNS servers to an nginx +mis-configuration on an almost unrelated set of servers, this is one of those +strange bugs that never has a nice solution and goes unsolved for a long time. +Spending the time to dive down the rabbit hole and find the answer is often +tedious, but also often very rewarding. + +[skydns]: https://github.com/skynetservices/skydns |