summaryrefslogtreecommitdiff
path: root/_posts/2015-03-11-rabbit-hole.md
diff options
context:
space:
mode:
Diffstat (limited to '_posts/2015-03-11-rabbit-hole.md')
-rw-r--r--_posts/2015-03-11-rabbit-hole.md165
1 files changed, 0 insertions, 165 deletions
diff --git a/_posts/2015-03-11-rabbit-hole.md b/_posts/2015-03-11-rabbit-hole.md
deleted file mode 100644
index 97c2b80..0000000
--- a/_posts/2015-03-11-rabbit-hole.md
+++ /dev/null
@@ -1,165 +0,0 @@
----
-title: Rabbit Hole
-description: >-
- Complex systems sometimes require complex debugging.
----
-
-We've begun rolling out [SkyDNS][skydns] at my job, which has been pretty neat.
-We're basing a couple future projects around being able to use it, and it's made
-dynamic configuration and service discovery nice and easy.
-
-This post chronicles catching a bug because of our switch to SkyDNS, and how we
-discover its root cause. I like to call these kinds of bugs "rabbit holes"; they
-look shallow at first, but anytime you make a little progress forward a little
-more is always required, until you discover the ending somewhere totally
-unrelated to the start.
-
-## The Bug
-
-We are seeing *tons* of these in the SkyDNS log:
-
-```
-[skydns] Feb 20 17:21:15.168 INFO | no nameservers defined or name too short, can not forward
-```
-
-I fire up tcpdump to see if I can see anything interesting, and sure enough run
-across a bunch of these:
-
-```
-# tcpdump -vvv -s 0 -l -n port 53
-tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
- ...
- $fen_ip.50257 > $skydns_ip.domain: [udp sum ok] 16218+ A? unknown. (25)
- $fen_ip.27372 > $skydns_ip.domain: [udp sum ok] 16218+ A? unknown. (25)
- $fen_ip.35634 > $skydns_ip.domain: [udp sum ok] 59227+ A? unknown. (25)
- $fen_ip.64363 > $skydns_ip.domain: [udp sum ok] 59227+ A? unknown. (25)
-```
-
-It appears that some of our front end nodes (FENs) are making tons of DNS
-fequests trying to find the A record of `unknown`. Something on our FENs is
-doing something insane and is breaking.
-
-## The FENs
-
-Hopping over to my favorite FEN we're able to see the packets in question
-leaving on a tcpdump as well, but that's not helpful for finding the root cause.
-We have lots of processes running on the FENs and any number of them could be
-doing something crazy.
-
-We fire up sysdig, which is similar to systemtap and strace in that it allows
-you to hook into the kernel and view various kernel activites in real time, but
-it's easier to use than both. The following command dumps all UDP packets being
-sent and what process is sending them:
-
-```
-# sysdig fd.l4proto=udp
-...
-2528950 22:17:35.260606188 0 php-fpm (21477) < connect res=0 tuple=$fen_ip:61173->$skydns_ip:53
-2528961 22:17:35.260611327 0 php-fpm (21477) > sendto fd=102(<4u>$fen_ip:61173->$skydns_ip:53) size=25 tuple=NULL
-2528991 22:17:35.260631917 0 php-fpm (21477) < sendto res=25 data=.r...........unknown.....
-2530470 22:17:35.261879032 0 php-fpm (21477) > ioctl fd=102(<4u>$fen_ip:61173->$skydns_ip:53) request=541B argument=7FFF82DC8728
-2530472 22:17:35.261880574 0 php-fpm (21477) < ioctl res=0
-2530474 22:17:35.261881226 0 php-fpm (21477) > recvfrom fd=102(<4u>$fen_ip:61173->$skydns_ip:53) size=1024
-2530476 22:17:35.261883424 0 php-fpm (21477) < recvfrom res=25 data=.r...........unknown..... tuple=$skydns_ip:53->$fen_ip:61173
-2530485 22:17:35.261888997 0 php-fpm (21477) > close fd=102(<4u>$fen_ip:61173->$skydns_ip:53)
-2530488 22:17:35.261892626 0 php-fpm (21477) < close res=0
-```
-
-Aha! We can see php-fpm is requesting something over udp with the string
-`unknown` in it. We've now narrowed down the guilty process, the rest should be
-easy right?
-
-## Which PHP?
-
-Unfortunately we're a PHP shop; knowing that php-fpm is doing something on a FEN
-narrows down the guilty codebase little. Taking the FEN out of our load-balancer
-stops the requests for `unknown`, so we *can* say that it's some user-facing
-code that is the culprit. Our setup on the FENs involves users hitting nginx
-for static content and nginx proxying PHP requests back to php-fpm. Since all
-our virtual domains are defined in nginx, we are able to do something horrible.
-
-On the particular FEN we're on we make a guess about which virtual domain the
-problem is likely coming from (our main app), and proxy all traffic from all
-other domains to a different FEN. We still see requests for `unknown` leaving
-the box, so we've narrowed the problem down a little more.
-
-## The Despair
-
-Nothing in our code is doing any direct DNS calls as far as we can find, and we
-don't see any places PHP might be doing it for us. We have lots of PHP
-extensions in place, all written in C and all black boxes; any of them could be
-the culprit. Grepping through the likely candidates' source code for the string
-`unknown` proves fruitless.
-
-We try xdebug at this point. xdebug is a profiler for php which will create
-cachegrind files for the running code. With cachegrind you can see every
-function which was ever called, how long spent within each function, a full
-call-graph, and lots more. Unfortunately xdebug outputs cachegrind files on a
-per-php-fpm-process basis, and overwrites the previous file on each new request.
-So xdebug is pretty much useless, since what is in the cachegrind file isn't
-necessarily what spawned the DNS request.
-
-## Gotcha (sorta)
-
-We turn back to the tried and true method of dumping all the traffic using
-tcpdump and perusing through that manually.
-
-What we find is that nearly everytime there is a DNS request for `unknown`, if
-we scroll up a bit there is (usually) a particular request to memcache. The
-requested key is always in the style of `function-name:someid:otherstuff`. When
-looking in the code around that function name we find this ominous looking call:
-
-```php
-$ipAddress = getIPAddress();
-$geoipInfo = getCountryInfoFromIP($ipAddress);
-```
-
-This points us in the right direction. On a hunch we add some debug
-logging to print out the `$ipAddress` variable, and sure enough it comes back as
-`unknown`. AHA!
-
-So what we surmise is happening is that for some reason our geoip extension,
-which we use to get the location data of an IP address and which
-`getCountryInfoFromIP` calls, is seeing something which is *not* an IP address
-and trying to resolve it.
-
-## Gotcha (for real)
-
-So the question becomes: why are we getting the string `unknown` as an IP
-address?
-
-Adding some debug logging around the area we find before showed that
-`$_SERVER['REMOTE_ADDR']`, which is the variable populated with the IP address
-of the client, is sometimes `unknown`. We guess that this has something to do
-with some magic we are doing on nginx's side to populate `REMOTE_ADDR` with the
-real IP address of the client in the case of them going through a proxy.
-
-Many proxies send along the header `X-Forwarded-For` to indicate the real IP of
-the client they're proxying for, otherwise the server would only see the proxy's
-IP. In our setup I decided that in those cases we should set the `REMOTE_ADDR`
-to the real client IP so our application logic doesn't even have to worry about
-it. There are a couple problems with this which render it a bad decision, one
-being that if some misbahaving proxy was to, say, start sending
-`X-Forwarded-For: unknown` then some written applications might mistake that to
-mean the client's IP is `unknown`.
-
-## The Fix
-
-The fix here was two-fold:
-
-1) We now always set `$_SERVER['REMOTE_ADDR']` to be the remote address of the
-requests, regardless of if it's a proxy, and also send the application the
-`X-Forwarded-For` header to do with as it pleases.
-
-2) Inside our app we look at all the headers sent and do some processing to
-decide what the actual client IP is. PHP can handle a lot more complex logic
-than nginx can, so we can do things like check to make sure the IP is an IP, and
-also that it's not some NAT'd internal ip, and so forth.
-
-And that's it. From some weird log messages on our DNS servers to an nginx
-mis-configuration on an almost unrelated set of servers, this is one of those
-strange bugs that never has a nice solution and goes unsolved for a long time.
-Spending the time to dive down the rabbit hole and find the answer is often
-tedious, but also often very rewarding.
-
-[skydns]: https://github.com/skynetservices/skydns