From 19eba1a5f25c1c27216fcef20ef16241a235e54f Mon Sep 17 00:00:00 2001 From: Brian Picciano Date: Wed, 16 Oct 2013 20:29:35 -0400 Subject: fixed posts, got rid of sample posts --- generations.md | 95 ---------------------------------------------------------- 1 file changed, 95 deletions(-) delete mode 100644 generations.md (limited to 'generations.md') diff --git a/generations.md b/generations.md deleted file mode 100644 index d36b175..0000000 --- a/generations.md +++ /dev/null @@ -1,95 +0,0 @@ -# Generations - -A simple file distribution strategy for very large scale, high-availability -file-services. - -# The problem - -Working at a shop where we have millions of different files, any of which could -be arbitrarily chosen to serve to a file at any given time. These files are -uploaded by users of the app and retrieved by others. - -Scaling such a system is no easy task. The chosen solution involves shuffling -files around on a nearly constant basis, making sure that files which are more -"popular" are on fast drives, while at the same time making sure that no drives -are at capicty and at the same time that all files, even newly uploaded ones, -are stored redundantly. - -The problem with this solution is one of coordination. At any given moment the -app needs to be able to "find" a file so it can give the client a link to -download the file from one of the servers that it's on. Full-filling this simple -requirement means that all datastores/caches where information about where a -file lives need to be up-to-date at all times, and even then there are -race-conditions and network failures to contend with, while at all times the -requirements of the app evolve and change. - -# A simpler solution - -Let's say you want all files which get uploaded to be replicated in triplicate -in some capacity. You buy three identical hard-disks, and put each on a separate -server. As files get uploaded by clients, each file gets put on each drive -immediately. When the drives are filled (which should be at around the same -time), you stop uploading to them. - -That was generation 0. - -You buy three more drives, and start putting all files on them instead. This is -going to be generation 1. Repeat until you run out of money. - -That's it. - -## That's it? - -It seems simple and obvious, and maybe it's the standard thing which is done, -but as far as I can tell no-one has written about it (though I'm probably not -searching for the right thing, let me know if this is the case!). - -## Advantages - -* It's so simple to implement, you could probably do it in a day if you're -starting a project from scratch - -* By definition of the scheme all files are replicated in multiple places. - -* Minimal information about where a file "is" needs to be stored. When a file is -uploaded all that's needed is to know what generation it is in, and then what -nodes/drives are in that generation. - -* Drives don't need to "know" about each other. What I mean by this is that -whatever is running as the receive point for file-uploads on each drive doesn't -have to coordinate with its siblings running on the other drives in the -generation. In fact it doesn't need to coordinate with anyone. You could -literally rsync files onto your drives if you wanted to. I would recommend using -[marlin][0] though :) - -* Scaling is easy. When you run out of space you can simply start a new -generation. If you don't like playing that close to the chest there's nothing to -say you can't have two generations active at the same time. - -* Upgrading is easy. As long as a generation is not marked-for-upload, you can -easily copy all files in the generation into a new set of bigger, badder drives, -add those drives into the generation in your code, remove the old ones, then -mark the generation as uploadable again. - -* Distribution is easy. You just copy a generation's files onto a new drive in -Europe or wherever you're getting an uptick in traffic from and you're good to -go. - -* Management is easy. It's trivial to find out how many times a file has been -replicated, or how many countries it's in, or what hardware it's being served -from (given you have easy access to information about specific drives). - -## Caveats - -The big caveat here is that this is just an idea. It has NOT been tested in -production. But we have enough faith in it that we're going to give it a shot at -cryptic.io. I'll keep this page updated. - -The second caveat is that this scheme does not inherently support caching. If a -file suddenly becomes super popular the world over your hard-disks might not be -able to keep up, and it's probably not feasible to have an FIO drive in *every* -generation. I think that [groupcache][1] may be the answer to this problem, -assuming your files are reasonably small, but again I haven't tested it yet. - -[0]: https://github.com/cryptic-io/marlin -[1]: https://github.com/golang/groupcache -- cgit v1.2.3