From 19eba1a5f25c1c27216fcef20ef16241a235e54f Mon Sep 17 00:00:00 2001 From: Brian Picciano Date: Wed, 16 Oct 2013 20:29:35 -0400 Subject: fixed posts, got rid of sample posts --- _posts/2013-08-12-another-sample.md | 42 ---- _posts/2013-08-20-sample-post.md | 42 ---- _posts/2013-10-8-generations.md | 100 +++++++++ _posts/2013-4-9-erlang-tcp-socket-pull-pattern.md | 257 ++++++++++++++++++++++ _posts/2013-7-11-goplus.md | 78 +++++++ erlang-tcp-socket-pull-pattern.md | 252 --------------------- generations.md | 95 -------- goplus.md | 73 ------ 8 files changed, 435 insertions(+), 504 deletions(-) delete mode 100644 _posts/2013-08-12-another-sample.md delete mode 100644 _posts/2013-08-20-sample-post.md create mode 100644 _posts/2013-10-8-generations.md create mode 100644 _posts/2013-4-9-erlang-tcp-socket-pull-pattern.md create mode 100644 _posts/2013-7-11-goplus.md delete mode 100644 erlang-tcp-socket-pull-pattern.md delete mode 100644 generations.md delete mode 100644 goplus.md diff --git a/_posts/2013-08-12-another-sample.md b/_posts/2013-08-12-another-sample.md deleted file mode 100644 index 2a0de20..0000000 --- a/_posts/2013-08-12-another-sample.md +++ /dev/null @@ -1,42 +0,0 @@ ---- -layout: post -title: Another sample -categories: -- blog ---- - -Tattooed roof party *vinyl* freegan single-origin coffee wayfarers tousled, umami yr -meggings hella selvage. Butcher bespoke seitan, cornhole umami gentrify put a bird -on it occupy trust fund. Umami whatever kitsch, locavore fingerstache Tumblr pork belly -[keffiyeh](#). Chia Echo Park Pitchfork, Blue Bottle [hashtag](#) stumptown skateboard selvage -mixtape. Echo Park retro butcher banjo cardigan, seitan flannel Brooklyn paleo fixie -Truffaut. Forage mustache Thundercats next level disrupt. Bicycle rights forage tattooed -chia, **wayfarers** swag raw denim hashtag biodiesel occupy gastropub! - ---- - -# It's all in the game. -## You come at the king, you best not miss. -### Be subtle with it, man. You know what subtle means? - -VHS post-ironic cred **bespoke** banjo. Yr wayfarers literally gentrify, flexitarian fap -dreamcatcher plaid cornhole Intelligentsia paleo. Beard try-hard direct trade, shabby chic -Helvetica `look ma, I can code`. Lo-fi American Apparel tattooed [Vice](#) tofu, yr vinyl. -Williamsburg butcher hella mumblecore fixie mlkshk, cliche wolf keytar mixtape kitsch banh mi -salvia. High Life Odd Future *chambray* kale chips hoodie, cray pop-up. Helvetica narwhal -iPhone try-hard jean shorts. - -> This is a quote from someone famous about productivity - - -Syntax highlighting with Solarized theme. - -{% highlight ruby %} -class User < ActiveRecord::Base - attr_accessible :email, :name - - ... tons of other crap ... - -end - -{% endhighlight %} diff --git a/_posts/2013-08-20-sample-post.md b/_posts/2013-08-20-sample-post.md deleted file mode 100644 index ea372fa..0000000 --- a/_posts/2013-08-20-sample-post.md +++ /dev/null @@ -1,42 +0,0 @@ ---- -layout: post -title: Sample post -categories: -- blog ---- - -Tattooed roof party *vinyl* freegan single-origin coffee wayfarers tousled, umami yr -meggings hella selvage. Butcher bespoke seitan, cornhole umami gentrify put a bird -on it occupy trust fund. Umami whatever kitsch, locavore fingerstache Tumblr pork belly -[keffiyeh](#). Chia Echo Park Pitchfork, Blue Bottle [hashtag](#) stumptown skateboard selvage -mixtape. Echo Park retro butcher banjo cardigan, seitan flannel Brooklyn paleo fixie -Truffaut. Forage mustache Thundercats next level disrupt. Bicycle rights forage tattooed -chia, **wayfarers** swag raw denim hashtag biodiesel occupy gastropub! - ---- - -# It's all in the game. -## You come at the king, you best not miss. -### Be subtle with it, man. You know what subtle means? - -VHS post-ironic cred **bespoke** banjo. Yr wayfarers literally gentrify, flexitarian fap -dreamcatcher plaid cornhole Intelligentsia paleo. Beard try-hard direct trade, shabby chic -Helvetica `look ma, I can code`. Lo-fi American Apparel tattooed [Vice](#) tofu, yr vinyl. -Williamsburg butcher hella mumblecore fixie mlkshk, cliche wolf keytar mixtape kitsch banh mi -salvia. High Life Odd Future *chambray* kale chips hoodie, cray pop-up. Helvetica narwhal -iPhone try-hard jean shorts. - -> This is a quote from someone famous about productivity - - -Syntax highlighting with Solarized theme. - -{% highlight ruby %} -class User < ActiveRecord::Base - attr_accessible :email, :name - - ... tons of other crap ... - -end - -{% endhighlight %} diff --git a/_posts/2013-10-8-generations.md b/_posts/2013-10-8-generations.md new file mode 100644 index 0000000..904f625 --- /dev/null +++ b/_posts/2013-10-8-generations.md @@ -0,0 +1,100 @@ +--- +layout: post +title: Generations +categories: +- crypticio +--- + +A simple file distribution strategy for very large scale, high-availability +file-services. + +# The problem + +Working at a shop where we have millions of different files, any of which could +be arbitrarily chosen to serve to a file at any given time. These files are +uploaded by users of the app and retrieved by others. + +Scaling such a system is no easy task. The chosen solution involves shuffling +files around on a nearly constant basis, making sure that files which are more +"popular" are on fast drives, while at the same time making sure that no drives +are at capicty and at the same time that all files, even newly uploaded ones, +are stored redundantly. + +The problem with this solution is one of coordination. At any given moment the +app needs to be able to "find" a file so it can give the client a link to +download the file from one of the servers that it's on. Full-filling this simple +requirement means that all datastores/caches where information about where a +file lives need to be up-to-date at all times, and even then there are +race-conditions and network failures to contend with, while at all times the +requirements of the app evolve and change. + +# A simpler solution + +Let's say you want all files which get uploaded to be replicated in triplicate +in some capacity. You buy three identical hard-disks, and put each on a separate +server. As files get uploaded by clients, each file gets put on each drive +immediately. When the drives are filled (which should be at around the same +time), you stop uploading to them. + +That was generation 0. + +You buy three more drives, and start putting all files on them instead. This is +going to be generation 1. Repeat until you run out of money. + +That's it. + +## That's it? + +It seems simple and obvious, and maybe it's the standard thing which is done, +but as far as I can tell no-one has written about it (though I'm probably not +searching for the right thing, let me know if this is the case!). + +## Advantages + +* It's so simple to implement, you could probably do it in a day if you're +starting a project from scratch + +* By definition of the scheme all files are replicated in multiple places. + +* Minimal information about where a file "is" needs to be stored. When a file is +uploaded all that's needed is to know what generation it is in, and then what +nodes/drives are in that generation. + +* Drives don't need to "know" about each other. What I mean by this is that +whatever is running as the receive point for file-uploads on each drive doesn't +have to coordinate with its siblings running on the other drives in the +generation. In fact it doesn't need to coordinate with anyone. You could +literally rsync files onto your drives if you wanted to. I would recommend using +[marlin][0] though :) + +* Scaling is easy. When you run out of space you can simply start a new +generation. If you don't like playing that close to the chest there's nothing to +say you can't have two generations active at the same time. + +* Upgrading is easy. As long as a generation is not marked-for-upload, you can +easily copy all files in the generation into a new set of bigger, badder drives, +add those drives into the generation in your code, remove the old ones, then +mark the generation as uploadable again. + +* Distribution is easy. You just copy a generation's files onto a new drive in +Europe or wherever you're getting an uptick in traffic from and you're good to +go. + +* Management is easy. It's trivial to find out how many times a file has been +replicated, or how many countries it's in, or what hardware it's being served +from (given you have easy access to information about specific drives). + +## Caveats + +The big caveat here is that this is just an idea. It has NOT been tested in +production. But we have enough faith in it that we're going to give it a shot at +cryptic.io. I'll keep this page updated. + +The second caveat is that this scheme does not inherently support caching. If a +file suddenly becomes super popular the world over your hard-disks might not be +able to keep up, and it's probably not feasible to have an FIO drive in *every* +generation. I think that [groupcache][1] may be the answer to this problem, +assuming your files are reasonably small, but again I haven't tested it yet. + +[0]: https://github.com/cryptic-io/marlin +[1]: https://github.com/golang/groupcache diff --git a/_posts/2013-4-9-erlang-tcp-socket-pull-pattern.md b/_posts/2013-4-9-erlang-tcp-socket-pull-pattern.md new file mode 100644 index 0000000..23d1736 --- /dev/null +++ b/_posts/2013-4-9-erlang-tcp-socket-pull-pattern.md @@ -0,0 +1,257 @@ +--- +layout: post +title: "Erlang, tcp sockets, and active true" +categories: +- erlang +--- + +If you don't know erlang then [you're missing out][0]. If you do know erlang, +you've probably at some point done something with tcp sockets. Erlang's highly +concurrent model of execution lends itself well to server programs where a high +number of active connections is desired. Each thread can autonomously handle its +single client, greatly simplifying the logic of the whole application while +still retaining [great performance characteristics][1]. + +# Background + +For an erlang thread which owns a single socket there are three different ways +to receive data off of that socket. These all revolve around the `active` +[setopts][2] flag. A socket can be set to one of: + +* `{active,false}` - All data must be obtained through [recv/2][3] calls. This + amounts to syncronous socket reading. + +* `{active,true}` - All data on the socket gets sent to the controlling thread + as a normal erlang message. It is the thread's + responsibility to keep up with the buffered data in the + message queue. This amounts to asyncronous socket reading. + +* `{active,once}` - When set the socket is placed in `{active,true}` for a + single packet. That is, once set the thread can expect a + single message to be sent to when data comes in. To receive + any more data off of the socket the socket must either be + read from using [recv/2][3] or be put in `{active,once}` or + `{active,true}`. + +# Which to use? + +Many (most?) tutorials advocate using `{active,once}` in your application +\[0]\[1]\[2]. This has to do with usability and security. When in `{active,true}` +it's possible for a client to flood the connection faster than the receiving +process will process those messages, potentially eating up a lot of memory in +the VM. However, if you want to be able to receive both tcp data messages as +well as other messages from other erlang processes at the same time you can't +use `{active,false}`. So `{active,once}` is generally preferred because it +deals with both of these problems quite well. + +# Why not to use `{active,once}` + +Here's what your classic `{active,once}` enabled tcp socket implementation will +probably look like: + +```erlang +-module(tcp_test). +-compile(export_all). + +-define(TCP_OPTS, [ + binary, + {packet, raw}, + {nodelay,true}, + {active, false}, + {reuseaddr, true}, + {keepalive,true}, + {backlog,500} +]). + +%Start listening +listen(Port) -> + {ok, L} = gen_tcp:listen(Port, ?TCP_OPTS), + ?MODULE:accept(L). + +%Accept a connection +accept(L) -> + {ok, Socket} = gen_tcp:accept(L), + ?MODULE:read_loop(Socket), + io:fwrite("Done reading, connection was closed\n"), + ?MODULE:accept(L). + +%Read everything it sends us +read_loop(Socket) -> + inet:setopts(Socket, [{active, once}]), + receive + {tcp, _, _} -> + do_stuff_here, + ?MODULE:read_loop(Socket); + {tcp_closed, _}-> donezo; + {tcp_error, _, _} -> donezo + end. +``` + +This code isn't actually usable for a production system; it doesn't even spawn a +new process for the new socket. But that's not the point I'm making. If I run it +with `tcp_test:listen(8000)`, and in other window do: + +```bash +while [ 1 ]; do echo "aloha"; done | nc localhost 8000 +``` + +We'll be flooding the the server with data pretty well. Using [eprof][4] we can +get an idea of how our code performs, and where the hang-ups are: + +```erlang +1> eprof:start(). +{ok,<0.34.0>} + +2> P = spawn(tcp_test,listen,[8000]). +<0.36.0> + +3> eprof:start_profiling([P]). +profiling + +4> running_the_while_loop. +running_the_while_loop + +5> eprof:stop_profiling(). +profiling_stopped + +6> eprof:analyze(procs,[{sort,time}]). + +****** Process <0.36.0> -- 100.00 % of profiled time *** +FUNCTION CALLS % TIME [uS / CALLS] +-------- ----- --- ---- [----------] +prim_inet:type_value_2/2 6 0.00 0 [ 0.00] + +....snip.... + +prim_inet:enc_opts/2 6 0.00 8 [ 1.33] +prim_inet:setopts/2 12303599 1.85 1466319 [ 0.12] +tcp_test:read_loop/1 12303598 2.22 1761775 [ 0.14] +prim_inet:encode_opt_val/1 12303599 3.50 2769285 [ 0.23] +prim_inet:ctl_cmd/3 12303600 4.29 3399333 [ 0.28] +prim_inet:enc_opt_val/2 24607203 5.28 4184818 [ 0.17] +inet:setopts/2 12303598 5.72 4533863 [ 0.37] +erlang:port_control/3 12303600 77.13 61085040 [ 4.96] +``` + +eprof shows us where our process is spending the majority of its time. The `%` +column indicates percentage of time the process spent during profiling inside +any function. We can pretty clearly see that the vast majority of time was spent +inside `erlang:port_control/3`, the BIF that `inet:setopts/2` uses to switch the +socket to `{active,once}` mode. Amongst the calls which were called on every +loop, it takes up by far the most amount of time. In addition all of those other +calls are also related to `inet:setopts/2`. + +I'm gonna rewrite our little listen server to use `{active,true}`, and we'll do +it all again: + +```erlang +-module(tcp_test). +-compile(export_all). + +-define(TCP_OPTS, [ + binary, + {packet, raw}, + {nodelay,true}, + {active, false}, + {reuseaddr, true}, + {keepalive,true}, + {backlog,500} +]). + +%Start listening +listen(Port) -> + {ok, L} = gen_tcp:listen(Port, ?TCP_OPTS), + ?MODULE:accept(L). + +%Accept a connection +accept(L) -> + {ok, Socket} = gen_tcp:accept(L), + inet:setopts(Socket, [{active, true}]), %Well this is new + ?MODULE:read_loop(Socket), + io:fwrite("Done reading, connection was closed\n"), + ?MODULE:accept(L). + +%Read everything it sends us +read_loop(Socket) -> + %inet:setopts(Socket, [{active, once}]), + receive + {tcp, _, _} -> + do_stuff_here, + ?MODULE:read_loop(Socket); + {tcp_closed, _}-> donezo; + {tcp_error, _, _} -> donezo + end. +``` + +And the profiling results: + +```erlang +1> eprof:start(). +{ok,<0.34.0>} + +2> P = spawn(tcp_test,listen,[8000]). +<0.36.0> + +3> eprof:start_profiling([P]). +profiling + +4> running_the_while_loop. +running_the_while_loop + +5> eprof:stop_profiling(). +profiling_stopped + +6> eprof:analyze(procs,[{sort,time}]). + +****** Process <0.36.0> -- 100.00 % of profiled time *** +FUNCTION CALLS % TIME [uS / CALLS] +-------- ----- --- ---- [----------] +prim_inet:enc_value_1/3 7 0.00 1 [ 0.14] +prim_inet:decode_opt_val/1 1 0.00 1 [ 1.00] +inet:setopts/2 1 0.00 2 [ 2.00] +prim_inet:setopts/2 2 0.00 2 [ 1.00] +prim_inet:enum_name/2 1 0.00 2 [ 2.00] +erlang:port_set_data/2 1 0.00 2 [ 2.00] +inet_db:register_socket/2 1 0.00 3 [ 3.00] +prim_inet:type_value_1/3 7 0.00 3 [ 0.43] + +.... snip .... + +prim_inet:type_opt_1/1 19 0.00 7 [ 0.37] +prim_inet:enc_value/3 7 0.00 7 [ 1.00] +prim_inet:enum_val/2 6 0.00 7 [ 1.17] +prim_inet:dec_opt_val/1 7 0.00 7 [ 1.00] +prim_inet:dec_value/2 6 0.00 10 [ 1.67] +prim_inet:enc_opt/1 13 0.00 12 [ 0.92] +prim_inet:type_opt/2 19 0.00 33 [ 1.74] +erlang:port_control/3 3 0.00 59 [ 19.67] +tcp_test:read_loop/1 20716370 100.00 12187488 [ 0.59] +``` + +This time our process spent almost no time at all (according to eprof, 0%) +fiddling with the socket opts. Instead it spent all of its time in the +read_loop doing the work we actually want to be doing. + +# So what does this mean? + +I'm by no means advocating never using `{active,once}`. The security concern is +still a completely valid concern and one that `{active,once}` mitigates quite +well. I'm simply pointing out that this mitigation has some fairly serious +performance implications which have the potential to bite you if you're not +careful, especially in cases where a socket is going to be receiving a large +amount of traffic. + +# Meta + +These tests were done using R15B03, but I've done similar ones in R14 and found +similar results. I have not tested R16. + +* \[0] http://learnyousomeerlang.com/buckets-of-sockets +* \[1] http://www.erlang.org/doc/man/gen_tcp.html#examples +* \[2] http://erlycoder.com/25/erlang-tcp-server-tcp-client-sockets-with-gen_tcp + +[0]: http://learnyousomeerlang.com/content +[1]: http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-1 +[2]: http://www.erlang.org/doc/man/inet.html#setopts-2 +[3]: http://www.erlang.org/doc/man/gen_tcp.html#recv-2 +[4]: http://www.erlang.org/doc/man/eprof.html diff --git a/_posts/2013-7-11-goplus.md b/_posts/2013-7-11-goplus.md new file mode 100644 index 0000000..0b76add --- /dev/null +++ b/_posts/2013-7-11-goplus.md @@ -0,0 +1,78 @@ +--- +layout: post +title: Go+ +categories: +- go +--- + +Compared to other languages go has some strange behavior regarding its project +root settings. If you import a library called `somelib`, go will look for a +`src/somelib` folder in all of the folders in the `$GOPATH` environment +variable. This works nicely for globally installed packages, but it makes +encapsulating a project with a specific version, or modified version, rather +tedious. Whenever you go to work on this project you'll have to add its path to +your `$GOPATH`, or add the path permanently, which could break other projects +which may use a different version of `somelib`. + +My solution is in the form of a simple script I'm calling go+. go+ will search +in currrent directory and all of its parents for a file called `GOPROJROOT`. If +it finds that file in a directory, it prepends that directory's absolute path to +your `$GOPATH` and stops the search. Regardless of whether or not `GOPROJROOT` +was found go+ will passthrough all arguments to the actual go call. The +modification to `$GOPATH` will only last the duration of the call. + +As an example, consider the following: +``` +/tmp + /hello + GOPROJROOT + /src + /somelib/somelib.go + /hello.go +``` + +If `hello.go` depends on `somelib`, as long as you run go+ from `/tmp/hello` or +one of its children your project will still compile + +Here is the source code for go+: + +```bash +#!/bin/sh + +SEARCHING_FOR=GOPROJROOT +ORIG_DIR=$(pwd) + +STOPSEARCH=0 +SEARCH_DIR=$ORIG_DIR +while [ $STOPSEARCH = 0 ]; do + + RES=$( find $SEARCH_DIR -maxdepth 1 -type f -name $SEARCHING_FOR | \ + grep -P "$SEARCHING_FOR$" | \ + head -n1 ) + + if [ "$RES" = "" ]; then + if [ "$SEARCH_DIR" = "/" ]; then + STOPSEARCH=1 + fi + cd .. + SEARCH_DIR=$(pwd) + else + export GOPATH=$SEARCH_DIR:$GOPATH + STOPSEARCH=1 + fi +done + +cd "$ORIG_DIR" +exec go $@ +``` + +# UPDATE: Goat + +I'm leaving this post for posterity, but go+ has some serious flaws in it. For +one, it doesn't allow for specifying the version of a dependency you want to +use. To this end, I wrote [goat][0] which does all the things go+ does, plus +real dependency management, PLUS it is built in a way that if you've been +following go's best-practices for code organization you shouldn't have to change +any of your existing code AT ALL. It's cool, check it out. + +[0]: http://github.com/mediocregopher/goat diff --git a/erlang-tcp-socket-pull-pattern.md b/erlang-tcp-socket-pull-pattern.md deleted file mode 100644 index 419d005..0000000 --- a/erlang-tcp-socket-pull-pattern.md +++ /dev/null @@ -1,252 +0,0 @@ -# Erlang, tcp sockets, and active true - -If you don't know erlang then [you're missing out][0]. If you do know erlang, -you've probably at some point done something with tcp sockets. Erlang's highly -concurrent model of execution lends itself well to server programs where a high -number of active connections is desired. Each thread can autonomously handle its -single client, greatly simplifying the logic of the whole application while -still retaining [great performance characteristics][1]. - -# Background - -For an erlang thread which owns a single socket there are three different ways -to receive data off of that socket. These all revolve around the `active` -[setopts][2] flag. A socket can be set to one of: - -* `{active,false}` - All data must be obtained through [recv/2][3] calls. This - amounts to syncronous socket reading. - -* `{active,true}` - All data on the socket gets sent to the controlling thread - as a normal erlang message. It is the thread's - responsibility to keep up with the buffered data in the - message queue. This amounts to asyncronous socket reading. - -* `{active,once}` - When set the socket is placed in `{active,true}` for a - single packet. That is, once set the thread can expect a - single message to be sent to when data comes in. To receive - any more data off of the socket the socket must either be - read from using [recv/2][3] or be put in `{active,once}` or - `{active,true}`. - -# Which to use? - -Many (most?) tutorials advocate using `{active,once}` in your application -\[0]\[1]\[2]. This has to do with usability and security. When in `{active,true}` -it's possible for a client to flood the connection faster than the receiving -process will process those messages, potentially eating up a lot of memory in -the VM. However, if you want to be able to receive both tcp data messages as -well as other messages from other erlang processes at the same time you can't -use `{active,false}`. So `{active,once}` is generally preferred because it -deals with both of these problems quite well. - -# Why not to use `{active,once}` - -Here's what your classic `{active,once}` enabled tcp socket implementation will -probably look like: - -```erlang --module(tcp_test). --compile(export_all). - --define(TCP_OPTS, [ - binary, - {packet, raw}, - {nodelay,true}, - {active, false}, - {reuseaddr, true}, - {keepalive,true}, - {backlog,500} -]). - -%Start listening -listen(Port) -> - {ok, L} = gen_tcp:listen(Port, ?TCP_OPTS), - ?MODULE:accept(L). - -%Accept a connection -accept(L) -> - {ok, Socket} = gen_tcp:accept(L), - ?MODULE:read_loop(Socket), - io:fwrite("Done reading, connection was closed\n"), - ?MODULE:accept(L). - -%Read everything it sends us -read_loop(Socket) -> - inet:setopts(Socket, [{active, once}]), - receive - {tcp, _, _} -> - do_stuff_here, - ?MODULE:read_loop(Socket); - {tcp_closed, _}-> donezo; - {tcp_error, _, _} -> donezo - end. -``` - -This code isn't actually usable for a production system; it doesn't even spawn a -new process for the new socket. But that's not the point I'm making. If I run it -with `tcp_test:listen(8000)`, and in other window do: - -```bash -while [ 1 ]; do echo "aloha"; done | nc localhost 8000 -``` - -We'll be flooding the the server with data pretty well. Using [eprof][4] we can -get an idea of how our code performs, and where the hang-ups are: - -```erlang -1> eprof:start(). -{ok,<0.34.0>} - -2> P = spawn(tcp_test,listen,[8000]). -<0.36.0> - -3> eprof:start_profiling([P]). -profiling - -4> running_the_while_loop. -running_the_while_loop - -5> eprof:stop_profiling(). -profiling_stopped - -6> eprof:analyze(procs,[{sort,time}]). - -****** Process <0.36.0> -- 100.00 % of profiled time *** -FUNCTION CALLS % TIME [uS / CALLS] --------- ----- --- ---- [----------] -prim_inet:type_value_2/2 6 0.00 0 [ 0.00] - -....snip.... - -prim_inet:enc_opts/2 6 0.00 8 [ 1.33] -prim_inet:setopts/2 12303599 1.85 1466319 [ 0.12] -tcp_test:read_loop/1 12303598 2.22 1761775 [ 0.14] -prim_inet:encode_opt_val/1 12303599 3.50 2769285 [ 0.23] -prim_inet:ctl_cmd/3 12303600 4.29 3399333 [ 0.28] -prim_inet:enc_opt_val/2 24607203 5.28 4184818 [ 0.17] -inet:setopts/2 12303598 5.72 4533863 [ 0.37] -erlang:port_control/3 12303600 77.13 61085040 [ 4.96] -``` - -eprof shows us where our process is spending the majority of its time. The `%` -column indicates percentage of time the process spent during profiling inside -any function. We can pretty clearly see that the vast majority of time was spent -inside `erlang:port_control/3`, the BIF that `inet:setopts/2` uses to switch the -socket to `{active,once}` mode. Amongst the calls which were called on every -loop, it takes up by far the most amount of time. In addition all of those other -calls are also related to `inet:setopts/2`. - -I'm gonna rewrite our little listen server to use `{active,true}`, and we'll do -it all again: - -```erlang --module(tcp_test). --compile(export_all). - --define(TCP_OPTS, [ - binary, - {packet, raw}, - {nodelay,true}, - {active, false}, - {reuseaddr, true}, - {keepalive,true}, - {backlog,500} -]). - -%Start listening -listen(Port) -> - {ok, L} = gen_tcp:listen(Port, ?TCP_OPTS), - ?MODULE:accept(L). - -%Accept a connection -accept(L) -> - {ok, Socket} = gen_tcp:accept(L), - inet:setopts(Socket, [{active, true}]), %Well this is new - ?MODULE:read_loop(Socket), - io:fwrite("Done reading, connection was closed\n"), - ?MODULE:accept(L). - -%Read everything it sends us -read_loop(Socket) -> - %inet:setopts(Socket, [{active, once}]), - receive - {tcp, _, _} -> - do_stuff_here, - ?MODULE:read_loop(Socket); - {tcp_closed, _}-> donezo; - {tcp_error, _, _} -> donezo - end. -``` - -And the profiling results: - -```erlang -1> eprof:start(). -{ok,<0.34.0>} - -2> P = spawn(tcp_test,listen,[8000]). -<0.36.0> - -3> eprof:start_profiling([P]). -profiling - -4> running_the_while_loop. -running_the_while_loop - -5> eprof:stop_profiling(). -profiling_stopped - -6> eprof:analyze(procs,[{sort,time}]). - -****** Process <0.36.0> -- 100.00 % of profiled time *** -FUNCTION CALLS % TIME [uS / CALLS] --------- ----- --- ---- [----------] -prim_inet:enc_value_1/3 7 0.00 1 [ 0.14] -prim_inet:decode_opt_val/1 1 0.00 1 [ 1.00] -inet:setopts/2 1 0.00 2 [ 2.00] -prim_inet:setopts/2 2 0.00 2 [ 1.00] -prim_inet:enum_name/2 1 0.00 2 [ 2.00] -erlang:port_set_data/2 1 0.00 2 [ 2.00] -inet_db:register_socket/2 1 0.00 3 [ 3.00] -prim_inet:type_value_1/3 7 0.00 3 [ 0.43] - -.... snip .... - -prim_inet:type_opt_1/1 19 0.00 7 [ 0.37] -prim_inet:enc_value/3 7 0.00 7 [ 1.00] -prim_inet:enum_val/2 6 0.00 7 [ 1.17] -prim_inet:dec_opt_val/1 7 0.00 7 [ 1.00] -prim_inet:dec_value/2 6 0.00 10 [ 1.67] -prim_inet:enc_opt/1 13 0.00 12 [ 0.92] -prim_inet:type_opt/2 19 0.00 33 [ 1.74] -erlang:port_control/3 3 0.00 59 [ 19.67] -tcp_test:read_loop/1 20716370 100.00 12187488 [ 0.59] -``` - -This time our process spent almost no time at all (according to eprof, 0%) -fiddling with the socket opts. Instead it spent all of its time in the -read_loop doing the work we actually want to be doing. - -# So what does this mean? - -I'm by no means advocating never using `{active,once}`. The security concern is -still a completely valid concern and one that `{active,once}` mitigates quite -well. I'm simply pointing out that this mitigation has some fairly serious -performance implications which have the potential to bite you if you're not -careful, especially in cases where a socket is going to be receiving a large -amount of traffic. - -# Meta - -These tests were done using R15B03, but I've done similar ones in R14 and found -similar results. I have not tested R16. - -* \[0] http://learnyousomeerlang.com/buckets-of-sockets -* \[1] http://www.erlang.org/doc/man/gen_tcp.html#examples -* \[2] http://erlycoder.com/25/erlang-tcp-server-tcp-client-sockets-with-gen_tcp - -[0]: http://learnyousomeerlang.com/content -[1]: http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-1 -[2]: http://www.erlang.org/doc/man/inet.html#setopts-2 -[3]: http://www.erlang.org/doc/man/gen_tcp.html#recv-2 -[4]: http://www.erlang.org/doc/man/eprof.html diff --git a/generations.md b/generations.md deleted file mode 100644 index d36b175..0000000 --- a/generations.md +++ /dev/null @@ -1,95 +0,0 @@ -# Generations - -A simple file distribution strategy for very large scale, high-availability -file-services. - -# The problem - -Working at a shop where we have millions of different files, any of which could -be arbitrarily chosen to serve to a file at any given time. These files are -uploaded by users of the app and retrieved by others. - -Scaling such a system is no easy task. The chosen solution involves shuffling -files around on a nearly constant basis, making sure that files which are more -"popular" are on fast drives, while at the same time making sure that no drives -are at capicty and at the same time that all files, even newly uploaded ones, -are stored redundantly. - -The problem with this solution is one of coordination. At any given moment the -app needs to be able to "find" a file so it can give the client a link to -download the file from one of the servers that it's on. Full-filling this simple -requirement means that all datastores/caches where information about where a -file lives need to be up-to-date at all times, and even then there are -race-conditions and network failures to contend with, while at all times the -requirements of the app evolve and change. - -# A simpler solution - -Let's say you want all files which get uploaded to be replicated in triplicate -in some capacity. You buy three identical hard-disks, and put each on a separate -server. As files get uploaded by clients, each file gets put on each drive -immediately. When the drives are filled (which should be at around the same -time), you stop uploading to them. - -That was generation 0. - -You buy three more drives, and start putting all files on them instead. This is -going to be generation 1. Repeat until you run out of money. - -That's it. - -## That's it? - -It seems simple and obvious, and maybe it's the standard thing which is done, -but as far as I can tell no-one has written about it (though I'm probably not -searching for the right thing, let me know if this is the case!). - -## Advantages - -* It's so simple to implement, you could probably do it in a day if you're -starting a project from scratch - -* By definition of the scheme all files are replicated in multiple places. - -* Minimal information about where a file "is" needs to be stored. When a file is -uploaded all that's needed is to know what generation it is in, and then what -nodes/drives are in that generation. - -* Drives don't need to "know" about each other. What I mean by this is that -whatever is running as the receive point for file-uploads on each drive doesn't -have to coordinate with its siblings running on the other drives in the -generation. In fact it doesn't need to coordinate with anyone. You could -literally rsync files onto your drives if you wanted to. I would recommend using -[marlin][0] though :) - -* Scaling is easy. When you run out of space you can simply start a new -generation. If you don't like playing that close to the chest there's nothing to -say you can't have two generations active at the same time. - -* Upgrading is easy. As long as a generation is not marked-for-upload, you can -easily copy all files in the generation into a new set of bigger, badder drives, -add those drives into the generation in your code, remove the old ones, then -mark the generation as uploadable again. - -* Distribution is easy. You just copy a generation's files onto a new drive in -Europe or wherever you're getting an uptick in traffic from and you're good to -go. - -* Management is easy. It's trivial to find out how many times a file has been -replicated, or how many countries it's in, or what hardware it's being served -from (given you have easy access to information about specific drives). - -## Caveats - -The big caveat here is that this is just an idea. It has NOT been tested in -production. But we have enough faith in it that we're going to give it a shot at -cryptic.io. I'll keep this page updated. - -The second caveat is that this scheme does not inherently support caching. If a -file suddenly becomes super popular the world over your hard-disks might not be -able to keep up, and it's probably not feasible to have an FIO drive in *every* -generation. I think that [groupcache][1] may be the answer to this problem, -assuming your files are reasonably small, but again I haven't tested it yet. - -[0]: https://github.com/cryptic-io/marlin -[1]: https://github.com/golang/groupcache diff --git a/goplus.md b/goplus.md deleted file mode 100644 index 58ab303..0000000 --- a/goplus.md +++ /dev/null @@ -1,73 +0,0 @@ -# Go and project root - -Compared to other languages go has some strange behavior regarding its project -root settings. If you import a library called `somelib`, go will look for a -`src/somelib` folder in all of the folders in the `$GOPATH` environment -variable. This works nicely for globally installed packages, but it makes -encapsulating a project with a specific version, or modified version, rather -tedious. Whenever you go to work on this project you'll have to add its path to -your `$GOPATH`, or add the path permanently, which could break other projects -which may use a different version of `somelib`. - -My solution is in the form of a simple script I'm calling go+. go+ will search -in currrent directory and all of its parents for a file called `GOPROJROOT`. If -it finds that file in a directory, it prepends that directory's absolute path to -your `$GOPATH` and stops the search. Regardless of whether or not `GOPROJROOT` -was found go+ will passthrough all arguments to the actual go call. The -modification to `$GOPATH` will only last the duration of the call. - -As an example, consider the following: -``` -/tmp - /hello - GOPROJROOT - /src - /somelib/somelib.go - /hello.go -``` - -If `hello.go` depends on `somelib`, as long as you run go+ from `/tmp/hello` or -one of its children your project will still compile - -Here is the source code for go+: - -```bash -#!/bin/sh - -SEARCHING_FOR=GOPROJROOT -ORIG_DIR=$(pwd) - -STOPSEARCH=0 -SEARCH_DIR=$ORIG_DIR -while [ $STOPSEARCH = 0 ]; do - - RES=$( find $SEARCH_DIR -maxdepth 1 -type f -name $SEARCHING_FOR | \ - grep -P "$SEARCHING_FOR$" | \ - head -n1 ) - - if [ "$RES" = "" ]; then - if [ "$SEARCH_DIR" = "/" ]; then - STOPSEARCH=1 - fi - cd .. - SEARCH_DIR=$(pwd) - else - export GOPATH=$SEARCH_DIR:$GOPATH - STOPSEARCH=1 - fi -done - -cd "$ORIG_DIR" -exec go $@ -``` - -# UPDATE: Goat - -I'm leaving this post for posterity, but go+ has some serious flaws in it. For -one, it doesn't allow for specifying the version of a dependency you want to -use. To this end, I wrote [goat][0] which does all the things go+ does, plus -real dependency management, PLUS it is built in a way that if you've been -following go's best-practices for code organization you shouldn't have to change -any of your existing code AT ALL. It's cool, check it out. - -[0]: http://github.com/mediocregopher/goat -- cgit v1.2.3