From 8c1c42913deb6a14f060c05e8a230c6aef23d948 Mon Sep 17 00:00:00 2001 From: Brian Picciano Date: Thu, 6 Sep 2018 11:59:01 -0400 Subject: rename post files to have padded 0 in month and day --- _posts/2015-3-11-rabbit-hole.md | 165 ---------------------------------------- 1 file changed, 165 deletions(-) delete mode 100644 _posts/2015-3-11-rabbit-hole.md (limited to '_posts/2015-3-11-rabbit-hole.md') diff --git a/_posts/2015-3-11-rabbit-hole.md b/_posts/2015-3-11-rabbit-hole.md deleted file mode 100644 index 97c2b80..0000000 --- a/_posts/2015-3-11-rabbit-hole.md +++ /dev/null @@ -1,165 +0,0 @@ ---- -title: Rabbit Hole -description: >- - Complex systems sometimes require complex debugging. ---- - -We've begun rolling out [SkyDNS][skydns] at my job, which has been pretty neat. -We're basing a couple future projects around being able to use it, and it's made -dynamic configuration and service discovery nice and easy. - -This post chronicles catching a bug because of our switch to SkyDNS, and how we -discover its root cause. I like to call these kinds of bugs "rabbit holes"; they -look shallow at first, but anytime you make a little progress forward a little -more is always required, until you discover the ending somewhere totally -unrelated to the start. - -## The Bug - -We are seeing *tons* of these in the SkyDNS log: - -``` -[skydns] Feb 20 17:21:15.168 INFO | no nameservers defined or name too short, can not forward -``` - -I fire up tcpdump to see if I can see anything interesting, and sure enough run -across a bunch of these: - -``` -# tcpdump -vvv -s 0 -l -n port 53 -tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes - ... - $fen_ip.50257 > $skydns_ip.domain: [udp sum ok] 16218+ A? unknown. (25) - $fen_ip.27372 > $skydns_ip.domain: [udp sum ok] 16218+ A? unknown. (25) - $fen_ip.35634 > $skydns_ip.domain: [udp sum ok] 59227+ A? unknown. (25) - $fen_ip.64363 > $skydns_ip.domain: [udp sum ok] 59227+ A? unknown. (25) -``` - -It appears that some of our front end nodes (FENs) are making tons of DNS -fequests trying to find the A record of `unknown`. Something on our FENs is -doing something insane and is breaking. - -## The FENs - -Hopping over to my favorite FEN we're able to see the packets in question -leaving on a tcpdump as well, but that's not helpful for finding the root cause. -We have lots of processes running on the FENs and any number of them could be -doing something crazy. - -We fire up sysdig, which is similar to systemtap and strace in that it allows -you to hook into the kernel and view various kernel activites in real time, but -it's easier to use than both. The following command dumps all UDP packets being -sent and what process is sending them: - -``` -# sysdig fd.l4proto=udp -... -2528950 22:17:35.260606188 0 php-fpm (21477) < connect res=0 tuple=$fen_ip:61173->$skydns_ip:53 -2528961 22:17:35.260611327 0 php-fpm (21477) > sendto fd=102(<4u>$fen_ip:61173->$skydns_ip:53) size=25 tuple=NULL -2528991 22:17:35.260631917 0 php-fpm (21477) < sendto res=25 data=.r...........unknown..... -2530470 22:17:35.261879032 0 php-fpm (21477) > ioctl fd=102(<4u>$fen_ip:61173->$skydns_ip:53) request=541B argument=7FFF82DC8728 -2530472 22:17:35.261880574 0 php-fpm (21477) < ioctl res=0 -2530474 22:17:35.261881226 0 php-fpm (21477) > recvfrom fd=102(<4u>$fen_ip:61173->$skydns_ip:53) size=1024 -2530476 22:17:35.261883424 0 php-fpm (21477) < recvfrom res=25 data=.r...........unknown..... tuple=$skydns_ip:53->$fen_ip:61173 -2530485 22:17:35.261888997 0 php-fpm (21477) > close fd=102(<4u>$fen_ip:61173->$skydns_ip:53) -2530488 22:17:35.261892626 0 php-fpm (21477) < close res=0 -``` - -Aha! We can see php-fpm is requesting something over udp with the string -`unknown` in it. We've now narrowed down the guilty process, the rest should be -easy right? - -## Which PHP? - -Unfortunately we're a PHP shop; knowing that php-fpm is doing something on a FEN -narrows down the guilty codebase little. Taking the FEN out of our load-balancer -stops the requests for `unknown`, so we *can* say that it's some user-facing -code that is the culprit. Our setup on the FENs involves users hitting nginx -for static content and nginx proxying PHP requests back to php-fpm. Since all -our virtual domains are defined in nginx, we are able to do something horrible. - -On the particular FEN we're on we make a guess about which virtual domain the -problem is likely coming from (our main app), and proxy all traffic from all -other domains to a different FEN. We still see requests for `unknown` leaving -the box, so we've narrowed the problem down a little more. - -## The Despair - -Nothing in our code is doing any direct DNS calls as far as we can find, and we -don't see any places PHP might be doing it for us. We have lots of PHP -extensions in place, all written in C and all black boxes; any of them could be -the culprit. Grepping through the likely candidates' source code for the string -`unknown` proves fruitless. - -We try xdebug at this point. xdebug is a profiler for php which will create -cachegrind files for the running code. With cachegrind you can see every -function which was ever called, how long spent within each function, a full -call-graph, and lots more. Unfortunately xdebug outputs cachegrind files on a -per-php-fpm-process basis, and overwrites the previous file on each new request. -So xdebug is pretty much useless, since what is in the cachegrind file isn't -necessarily what spawned the DNS request. - -## Gotcha (sorta) - -We turn back to the tried and true method of dumping all the traffic using -tcpdump and perusing through that manually. - -What we find is that nearly everytime there is a DNS request for `unknown`, if -we scroll up a bit there is (usually) a particular request to memcache. The -requested key is always in the style of `function-name:someid:otherstuff`. When -looking in the code around that function name we find this ominous looking call: - -```php -$ipAddress = getIPAddress(); -$geoipInfo = getCountryInfoFromIP($ipAddress); -``` - -This points us in the right direction. On a hunch we add some debug -logging to print out the `$ipAddress` variable, and sure enough it comes back as -`unknown`. AHA! - -So what we surmise is happening is that for some reason our geoip extension, -which we use to get the location data of an IP address and which -`getCountryInfoFromIP` calls, is seeing something which is *not* an IP address -and trying to resolve it. - -## Gotcha (for real) - -So the question becomes: why are we getting the string `unknown` as an IP -address? - -Adding some debug logging around the area we find before showed that -`$_SERVER['REMOTE_ADDR']`, which is the variable populated with the IP address -of the client, is sometimes `unknown`. We guess that this has something to do -with some magic we are doing on nginx's side to populate `REMOTE_ADDR` with the -real IP address of the client in the case of them going through a proxy. - -Many proxies send along the header `X-Forwarded-For` to indicate the real IP of -the client they're proxying for, otherwise the server would only see the proxy's -IP. In our setup I decided that in those cases we should set the `REMOTE_ADDR` -to the real client IP so our application logic doesn't even have to worry about -it. There are a couple problems with this which render it a bad decision, one -being that if some misbahaving proxy was to, say, start sending -`X-Forwarded-For: unknown` then some written applications might mistake that to -mean the client's IP is `unknown`. - -## The Fix - -The fix here was two-fold: - -1) We now always set `$_SERVER['REMOTE_ADDR']` to be the remote address of the -requests, regardless of if it's a proxy, and also send the application the -`X-Forwarded-For` header to do with as it pleases. - -2) Inside our app we look at all the headers sent and do some processing to -decide what the actual client IP is. PHP can handle a lot more complex logic -than nginx can, so we can do things like check to make sure the IP is an IP, and -also that it's not some NAT'd internal ip, and so forth. - -And that's it. From some weird log messages on our DNS servers to an nginx -mis-configuration on an almost unrelated set of servers, this is one of those -strange bugs that never has a nice solution and goes unsolved for a long time. -Spending the time to dive down the rabbit hole and find the answer is often -tedious, but also often very rewarding. - -[skydns]: https://github.com/skynetservices/skydns -- cgit v1.2.3