summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--README.md75
-rw-r--r--erlang-tcp-socket-pull-pattern.md252
-rw-r--r--generations.md95
-rw-r--r--goplus.md73
-rw-r--r--lagom-master.zipbin216464 -> 0 bytes
-rw-r--r--lagom-master/.gitignore2
-rwxr-xr-xres/go+27
7 files changed, 453 insertions, 71 deletions
diff --git a/README.md b/README.md
index eb1ff2e..4d685c9 100644
--- a/README.md
+++ b/README.md
@@ -1,72 +1,9 @@
-# Lagom
+This is my here blog. It's not much at the moment (one post? booyah!), but maybe it'll grow.
-> #### *Lagom* is a Swedish word with no direct English equivalent, meaning "just the right amount"
+Maybe not
-Lagom, a [Jekyll][j] blog theme with just the right amount of style.
+* [Erlang, tcp sockets, and active true](erlang-tcp-socket-pull-pattern.md) (originally posted March 9, 2013)
+* [go+](goplus.md) (originally posted July 11, 2013)
+* [Generations](generations.md) (originally posted October 8, 2013)
-Extracted lovingly from [http://mdswanson.com][mds] for your enjoyment!
-
-* Responsive, based on [Skeleton][skeleton]
-* [Font Awesome][font-awesome] for icons
-* Open Sans from [Google web fonts][gfonts]
-* Built-in Atom RSS feed
-
-## Action Shots
-![](http://i.imgur.com/Pmzk4j1.png)
-![](http://i.imgur.com/CT2Xvug.png)
-![](http://i.imgur.com/XisjqW1.jpg)
-
-## Installation
-
-- Install Jekyll: `gem install jekyll`
-- [Fork this repository][fork]
-- Clone it: `git clone https://github.com/YOUR-USER/lagom`
-- Run the jekyll server: `jekyll serve`
-
-You should have a server up and running locally at <http://localhost:4000>.
-
-## Customization
-
-Next you'll want to change a few things. Most of them can be changed directly in
-[_config.yml][config]. That's where you can add your social links, change the accent
-color, stuff like that.
-
-There's a few other places that you'll want to change, too:
-
-- [CNAME][cname]: If you're using this on GitHub Pages with a custom domain name,
- you'll want to change this to be the domain you're going to use. All that should
- be in here is a domain name on the first line and nothing else (like: `example.com`).
-- [favicon.png][favicon]: This is the icon in your browser's address bar. You should
- change it to whatever you'd like.
-- [logo.png][logo]: A square-ish image that appears in the upper-left corner
-
-## Deployment
-
-You should deploy with [GitHub Pages][pages] - it's just easier.
-
-All you should have to do is rename your repository on GitHub to be
-`username.github.io`. Since everything is on the `gh-pages` branch, you
-should be able to see your new site at <http://username.github.io>.
-
-## Licensing
-
-[MIT](https://github.com/swanson/lagom/blob/master/LICENSE) with no
-added caveats, so feel free to use this on your site without linking back to
-me or using a disclaimer or anything silly like that.
-
-## Contact
-I'd love to hear from you at [@_swanson][twitter]. Feel free to open issues if you
-run into trouble or have suggestions. Pull Requests always welcome.
-
-[j]: http://jekyllrb.com/
-[mds]: http://mdswanson.com
-[skeleton]: http://www.getskeleton.com/
-[font-awesome]: http://fortawesome.github.io/Font-Awesome/
-[gfonts]: http://www.google.com/fonts/specimen/Open+Sans
-[fork]: https://github.com/swanson/lagom/fork
-[config]: https://github.com/swanson/lagom/blob/master/_config.yml
-[cname]: https://github.com/swanson/lagom/blob/master/CNAME
-[favicon]: https://github.com/swanson/lagom/blob/master/favicon.png
-[logo]: https://github.com/swanson/lagom/blob/master/logo.png
-[pages]: http://pages.github.com
-[twitter]: https://twitter.com/_swanson
+That's all folks!
diff --git a/erlang-tcp-socket-pull-pattern.md b/erlang-tcp-socket-pull-pattern.md
new file mode 100644
index 0000000..419d005
--- /dev/null
+++ b/erlang-tcp-socket-pull-pattern.md
@@ -0,0 +1,252 @@
+# Erlang, tcp sockets, and active true
+
+If you don't know erlang then [you're missing out][0]. If you do know erlang,
+you've probably at some point done something with tcp sockets. Erlang's highly
+concurrent model of execution lends itself well to server programs where a high
+number of active connections is desired. Each thread can autonomously handle its
+single client, greatly simplifying the logic of the whole application while
+still retaining [great performance characteristics][1].
+
+# Background
+
+For an erlang thread which owns a single socket there are three different ways
+to receive data off of that socket. These all revolve around the `active`
+[setopts][2] flag. A socket can be set to one of:
+
+* `{active,false}` - All data must be obtained through [recv/2][3] calls. This
+ amounts to syncronous socket reading.
+
+* `{active,true}` - All data on the socket gets sent to the controlling thread
+ as a normal erlang message. It is the thread's
+ responsibility to keep up with the buffered data in the
+ message queue. This amounts to asyncronous socket reading.
+
+* `{active,once}` - When set the socket is placed in `{active,true}` for a
+ single packet. That is, once set the thread can expect a
+ single message to be sent to when data comes in. To receive
+ any more data off of the socket the socket must either be
+ read from using [recv/2][3] or be put in `{active,once}` or
+ `{active,true}`.
+
+# Which to use?
+
+Many (most?) tutorials advocate using `{active,once}` in your application
+\[0]\[1]\[2]. This has to do with usability and security. When in `{active,true}`
+it's possible for a client to flood the connection faster than the receiving
+process will process those messages, potentially eating up a lot of memory in
+the VM. However, if you want to be able to receive both tcp data messages as
+well as other messages from other erlang processes at the same time you can't
+use `{active,false}`. So `{active,once}` is generally preferred because it
+deals with both of these problems quite well.
+
+# Why not to use `{active,once}`
+
+Here's what your classic `{active,once}` enabled tcp socket implementation will
+probably look like:
+
+```erlang
+-module(tcp_test).
+-compile(export_all).
+
+-define(TCP_OPTS, [
+ binary,
+ {packet, raw},
+ {nodelay,true},
+ {active, false},
+ {reuseaddr, true},
+ {keepalive,true},
+ {backlog,500}
+]).
+
+%Start listening
+listen(Port) ->
+ {ok, L} = gen_tcp:listen(Port, ?TCP_OPTS),
+ ?MODULE:accept(L).
+
+%Accept a connection
+accept(L) ->
+ {ok, Socket} = gen_tcp:accept(L),
+ ?MODULE:read_loop(Socket),
+ io:fwrite("Done reading, connection was closed\n"),
+ ?MODULE:accept(L).
+
+%Read everything it sends us
+read_loop(Socket) ->
+ inet:setopts(Socket, [{active, once}]),
+ receive
+ {tcp, _, _} ->
+ do_stuff_here,
+ ?MODULE:read_loop(Socket);
+ {tcp_closed, _}-> donezo;
+ {tcp_error, _, _} -> donezo
+ end.
+```
+
+This code isn't actually usable for a production system; it doesn't even spawn a
+new process for the new socket. But that's not the point I'm making. If I run it
+with `tcp_test:listen(8000)`, and in other window do:
+
+```bash
+while [ 1 ]; do echo "aloha"; done | nc localhost 8000
+```
+
+We'll be flooding the the server with data pretty well. Using [eprof][4] we can
+get an idea of how our code performs, and where the hang-ups are:
+
+```erlang
+1> eprof:start().
+{ok,<0.34.0>}
+
+2> P = spawn(tcp_test,listen,[8000]).
+<0.36.0>
+
+3> eprof:start_profiling([P]).
+profiling
+
+4> running_the_while_loop.
+running_the_while_loop
+
+5> eprof:stop_profiling().
+profiling_stopped
+
+6> eprof:analyze(procs,[{sort,time}]).
+
+****** Process <0.36.0> -- 100.00 % of profiled time ***
+FUNCTION CALLS % TIME [uS / CALLS]
+-------- ----- --- ---- [----------]
+prim_inet:type_value_2/2 6 0.00 0 [ 0.00]
+
+....snip....
+
+prim_inet:enc_opts/2 6 0.00 8 [ 1.33]
+prim_inet:setopts/2 12303599 1.85 1466319 [ 0.12]
+tcp_test:read_loop/1 12303598 2.22 1761775 [ 0.14]
+prim_inet:encode_opt_val/1 12303599 3.50 2769285 [ 0.23]
+prim_inet:ctl_cmd/3 12303600 4.29 3399333 [ 0.28]
+prim_inet:enc_opt_val/2 24607203 5.28 4184818 [ 0.17]
+inet:setopts/2 12303598 5.72 4533863 [ 0.37]
+erlang:port_control/3 12303600 77.13 61085040 [ 4.96]
+```
+
+eprof shows us where our process is spending the majority of its time. The `%`
+column indicates percentage of time the process spent during profiling inside
+any function. We can pretty clearly see that the vast majority of time was spent
+inside `erlang:port_control/3`, the BIF that `inet:setopts/2` uses to switch the
+socket to `{active,once}` mode. Amongst the calls which were called on every
+loop, it takes up by far the most amount of time. In addition all of those other
+calls are also related to `inet:setopts/2`.
+
+I'm gonna rewrite our little listen server to use `{active,true}`, and we'll do
+it all again:
+
+```erlang
+-module(tcp_test).
+-compile(export_all).
+
+-define(TCP_OPTS, [
+ binary,
+ {packet, raw},
+ {nodelay,true},
+ {active, false},
+ {reuseaddr, true},
+ {keepalive,true},
+ {backlog,500}
+]).
+
+%Start listening
+listen(Port) ->
+ {ok, L} = gen_tcp:listen(Port, ?TCP_OPTS),
+ ?MODULE:accept(L).
+
+%Accept a connection
+accept(L) ->
+ {ok, Socket} = gen_tcp:accept(L),
+ inet:setopts(Socket, [{active, true}]), %Well this is new
+ ?MODULE:read_loop(Socket),
+ io:fwrite("Done reading, connection was closed\n"),
+ ?MODULE:accept(L).
+
+%Read everything it sends us
+read_loop(Socket) ->
+ %inet:setopts(Socket, [{active, once}]),
+ receive
+ {tcp, _, _} ->
+ do_stuff_here,
+ ?MODULE:read_loop(Socket);
+ {tcp_closed, _}-> donezo;
+ {tcp_error, _, _} -> donezo
+ end.
+```
+
+And the profiling results:
+
+```erlang
+1> eprof:start().
+{ok,<0.34.0>}
+
+2> P = spawn(tcp_test,listen,[8000]).
+<0.36.0>
+
+3> eprof:start_profiling([P]).
+profiling
+
+4> running_the_while_loop.
+running_the_while_loop
+
+5> eprof:stop_profiling().
+profiling_stopped
+
+6> eprof:analyze(procs,[{sort,time}]).
+
+****** Process <0.36.0> -- 100.00 % of profiled time ***
+FUNCTION CALLS % TIME [uS / CALLS]
+-------- ----- --- ---- [----------]
+prim_inet:enc_value_1/3 7 0.00 1 [ 0.14]
+prim_inet:decode_opt_val/1 1 0.00 1 [ 1.00]
+inet:setopts/2 1 0.00 2 [ 2.00]
+prim_inet:setopts/2 2 0.00 2 [ 1.00]
+prim_inet:enum_name/2 1 0.00 2 [ 2.00]
+erlang:port_set_data/2 1 0.00 2 [ 2.00]
+inet_db:register_socket/2 1 0.00 3 [ 3.00]
+prim_inet:type_value_1/3 7 0.00 3 [ 0.43]
+
+.... snip ....
+
+prim_inet:type_opt_1/1 19 0.00 7 [ 0.37]
+prim_inet:enc_value/3 7 0.00 7 [ 1.00]
+prim_inet:enum_val/2 6 0.00 7 [ 1.17]
+prim_inet:dec_opt_val/1 7 0.00 7 [ 1.00]
+prim_inet:dec_value/2 6 0.00 10 [ 1.67]
+prim_inet:enc_opt/1 13 0.00 12 [ 0.92]
+prim_inet:type_opt/2 19 0.00 33 [ 1.74]
+erlang:port_control/3 3 0.00 59 [ 19.67]
+tcp_test:read_loop/1 20716370 100.00 12187488 [ 0.59]
+```
+
+This time our process spent almost no time at all (according to eprof, 0%)
+fiddling with the socket opts. Instead it spent all of its time in the
+read_loop doing the work we actually want to be doing.
+
+# So what does this mean?
+
+I'm by no means advocating never using `{active,once}`. The security concern is
+still a completely valid concern and one that `{active,once}` mitigates quite
+well. I'm simply pointing out that this mitigation has some fairly serious
+performance implications which have the potential to bite you if you're not
+careful, especially in cases where a socket is going to be receiving a large
+amount of traffic.
+
+# Meta
+
+These tests were done using R15B03, but I've done similar ones in R14 and found
+similar results. I have not tested R16.
+
+* \[0] http://learnyousomeerlang.com/buckets-of-sockets
+* \[1] http://www.erlang.org/doc/man/gen_tcp.html#examples
+* \[2] http://erlycoder.com/25/erlang-tcp-server-tcp-client-sockets-with-gen_tcp
+
+[0]: http://learnyousomeerlang.com/content
+[1]: http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-1
+[2]: http://www.erlang.org/doc/man/inet.html#setopts-2
+[3]: http://www.erlang.org/doc/man/gen_tcp.html#recv-2
+[4]: http://www.erlang.org/doc/man/eprof.html
diff --git a/generations.md b/generations.md
new file mode 100644
index 0000000..d36b175
--- /dev/null
+++ b/generations.md
@@ -0,0 +1,95 @@
+# Generations
+
+A simple file distribution strategy for very large scale, high-availability
+file-services.
+
+# The problem
+
+Working at a shop where we have millions of different files, any of which could
+be arbitrarily chosen to serve to a file at any given time. These files are
+uploaded by users of the app and retrieved by others.
+
+Scaling such a system is no easy task. The chosen solution involves shuffling
+files around on a nearly constant basis, making sure that files which are more
+"popular" are on fast drives, while at the same time making sure that no drives
+are at capicty and at the same time that all files, even newly uploaded ones,
+are stored redundantly.
+
+The problem with this solution is one of coordination. At any given moment the
+app needs to be able to "find" a file so it can give the client a link to
+download the file from one of the servers that it's on. Full-filling this simple
+requirement means that all datastores/caches where information about where a
+file lives need to be up-to-date at all times, and even then there are
+race-conditions and network failures to contend with, while at all times the
+requirements of the app evolve and change.
+
+# A simpler solution
+
+Let's say you want all files which get uploaded to be replicated in triplicate
+in some capacity. You buy three identical hard-disks, and put each on a separate
+server. As files get uploaded by clients, each file gets put on each drive
+immediately. When the drives are filled (which should be at around the same
+time), you stop uploading to them.
+
+That was generation 0.
+
+You buy three more drives, and start putting all files on them instead. This is
+going to be generation 1. Repeat until you run out of money.
+
+That's it.
+
+## That's it?
+
+It seems simple and obvious, and maybe it's the standard thing which is done,
+but as far as I can tell no-one has written about it (though I'm probably not
+searching for the right thing, let me know if this is the case!).
+
+## Advantages
+
+* It's so simple to implement, you could probably do it in a day if you're
+starting a project from scratch
+
+* By definition of the scheme all files are replicated in multiple places.
+
+* Minimal information about where a file "is" needs to be stored. When a file is
+uploaded all that's needed is to know what generation it is in, and then what
+nodes/drives are in that generation.
+
+* Drives don't need to "know" about each other. What I mean by this is that
+whatever is running as the receive point for file-uploads on each drive doesn't
+have to coordinate with its siblings running on the other drives in the
+generation. In fact it doesn't need to coordinate with anyone. You could
+literally rsync files onto your drives if you wanted to. I would recommend using
+[marlin][0] though :)
+
+* Scaling is easy. When you run out of space you can simply start a new
+generation. If you don't like playing that close to the chest there's nothing to
+say you can't have two generations active at the same time.
+
+* Upgrading is easy. As long as a generation is not marked-for-upload, you can
+easily copy all files in the generation into a new set of bigger, badder drives,
+add those drives into the generation in your code, remove the old ones, then
+mark the generation as uploadable again.
+
+* Distribution is easy. You just copy a generation's files onto a new drive in
+Europe or wherever you're getting an uptick in traffic from and you're good to
+go.
+
+* Management is easy. It's trivial to find out how many times a file has been
+replicated, or how many countries it's in, or what hardware it's being served
+from (given you have easy access to information about specific drives).
+
+## Caveats
+
+The big caveat here is that this is just an idea. It has NOT been tested in
+production. But we have enough faith in it that we're going to give it a shot at
+cryptic.io. I'll keep this page updated.
+
+The second caveat is that this scheme does not inherently support caching. If a
+file suddenly becomes super popular the world over your hard-disks might not be
+able to keep up, and it's probably not feasible to have an FIO drive in *every*
+generation. I think that [groupcache][1] may be the answer to this problem,
+assuming your files are reasonably small, but again I haven't tested it yet.
+
+[0]: https://github.com/cryptic-io/marlin
+[1]: https://github.com/golang/groupcache
diff --git a/goplus.md b/goplus.md
new file mode 100644
index 0000000..58ab303
--- /dev/null
+++ b/goplus.md
@@ -0,0 +1,73 @@
+# Go and project root
+
+Compared to other languages go has some strange behavior regarding its project
+root settings. If you import a library called `somelib`, go will look for a
+`src/somelib` folder in all of the folders in the `$GOPATH` environment
+variable. This works nicely for globally installed packages, but it makes
+encapsulating a project with a specific version, or modified version, rather
+tedious. Whenever you go to work on this project you'll have to add its path to
+your `$GOPATH`, or add the path permanently, which could break other projects
+which may use a different version of `somelib`.
+
+My solution is in the form of a simple script I'm calling go+. go+ will search
+in currrent directory and all of its parents for a file called `GOPROJROOT`. If
+it finds that file in a directory, it prepends that directory's absolute path to
+your `$GOPATH` and stops the search. Regardless of whether or not `GOPROJROOT`
+was found go+ will passthrough all arguments to the actual go call. The
+modification to `$GOPATH` will only last the duration of the call.
+
+As an example, consider the following:
+```
+/tmp
+ /hello
+ GOPROJROOT
+ /src
+ /somelib/somelib.go
+ /hello.go
+```
+
+If `hello.go` depends on `somelib`, as long as you run go+ from `/tmp/hello` or
+one of its children your project will still compile
+
+Here is the source code for go+:
+
+```bash
+#!/bin/sh
+
+SEARCHING_FOR=GOPROJROOT
+ORIG_DIR=$(pwd)
+
+STOPSEARCH=0
+SEARCH_DIR=$ORIG_DIR
+while [ $STOPSEARCH = 0 ]; do
+
+ RES=$( find $SEARCH_DIR -maxdepth 1 -type f -name $SEARCHING_FOR | \
+ grep -P "$SEARCHING_FOR$" | \
+ head -n1 )
+
+ if [ "$RES" = "" ]; then
+ if [ "$SEARCH_DIR" = "/" ]; then
+ STOPSEARCH=1
+ fi
+ cd ..
+ SEARCH_DIR=$(pwd)
+ else
+ export GOPATH=$SEARCH_DIR:$GOPATH
+ STOPSEARCH=1
+ fi
+done
+
+cd "$ORIG_DIR"
+exec go $@
+```
+
+# UPDATE: Goat
+
+I'm leaving this post for posterity, but go+ has some serious flaws in it. For
+one, it doesn't allow for specifying the version of a dependency you want to
+use. To this end, I wrote [goat][0] which does all the things go+ does, plus
+real dependency management, PLUS it is built in a way that if you've been
+following go's best-practices for code organization you shouldn't have to change
+any of your existing code AT ALL. It's cool, check it out.
+
+[0]: http://github.com/mediocregopher/goat
diff --git a/lagom-master.zip b/lagom-master.zip
deleted file mode 100644
index ef9cd79..0000000
--- a/lagom-master.zip
+++ /dev/null
Binary files differ
diff --git a/lagom-master/.gitignore b/lagom-master/.gitignore
deleted file mode 100644
index f11e635..0000000
--- a/lagom-master/.gitignore
+++ /dev/null
@@ -1,2 +0,0 @@
-_site/
-.DS_Store \ No newline at end of file
diff --git a/res/go+ b/res/go+
new file mode 100755
index 0000000..835a72d
--- /dev/null
+++ b/res/go+
@@ -0,0 +1,27 @@
+#!/bin/sh
+
+SEARCHING_FOR=GOPROJROOT
+ORIG_DIR=$(pwd)
+
+STOPSEARCH=0
+SEARCH_DIR=$ORIG_DIR
+while [ $STOPSEARCH = 0 ]; do
+
+ RES=$( find $SEARCH_DIR -maxdepth 1 -type f -name $SEARCHING_FOR | \
+ grep -P "$SEARCHING_FOR$" | \
+ head -n1 )
+
+ if [ "$RES" = "" ]; then
+ if [ "$SEARCH_DIR" = "/" ]; then
+ STOPSEARCH=1
+ fi
+ cd ..
+ SEARCH_DIR=$(pwd)
+ else
+ export GOPATH=$SEARCH_DIR:$GOPATH
+ STOPSEARCH=1
+ fi
+done
+
+cd "$ORIG_DIR"
+exec go $@