Blog Home  Home RSS 2.0 Atom 1.0 CDF  
hacker emblem
PalmerEk - SimpleDB, BigTable, HBase
using System.IO;
using System.Web;
using System.Mind;
 
 Friday, May 23, 2008

Over the past couple of days I've spent my free brain cycles trying to wrap my head around the various implementations of persistent multidimensional sparse arrays which seem to have cropped up recently.  In particular I've been focusing in on Amazon's SimpleDB service, and I have to say that the mind shift from an RDBMS mentality has been a bit difficult.

The underlining concept is pretty straight forward and is something most low-level developers have used at one point in their life or another.  Basically it boils down to a multidimensional sparse array that can be persisted, queried and retrieved.  A rough translation of SimpleDB terms to RDBMS would be

SimpleDB RDBMS
Domain Schema
Table Table
Row Row
Column Column
Attribute Column Value(s)

While at first glance it seems like a reasonable one-to-one mapping there are some very important differences.  First of the columns for one row may not be the same as the columns for another row.  This is where the "sparse" part comes in.  If a column doesn't have an attribute then that column doesn't exist in the row.  This also allows for columns to be easily added on the fly.  The next significant difference is that an attribute can be comprised of multiple values.  Unlike an RDBMS where multiple values for a given column are stored in multiple rows, in SimpleDB a single row may have a single column with dozens of values.  Once you wrap your head around this it makes sense and you realize the value of this approach.

Where the difficulty comes in is trying to picture how this would be used in a real world application.  I have an active project I'm currently working on (active = touched in the past 30 days) which is basically Yet Another Social Network.  It's comprised of Users, Posts, Comments and Grouping information.  In an RDBMS these would loosely translate into tables which would relate to each other through a set of mapping tables.  Those tables could then be queried using the standard JOIN nomenclature to do things like retrieve the list of Comments for a Post and the Users who authored each of them.

In the SimpleDB world things work a little differently.  First off it's not a good idea to store "large" objects in the DB.  Instead it is better to store the large objects on something more akin to a file system (aka: S3).  So in this example instead of storing the actual Post or Comment in the DB, each would be stored in a separate file in an S3 bucket.  What would be stored in the DB would be the information you would want to search and information about the relationships of the various files.

With my current thinking that looks like a single domain/table that looks minimally like the following:

Post
User
Group
Comment

An example of how the data would then appear would be:

Post User Group Comment
Post1.html User1.xml Group1.txt Comment1.html, Comment2.html, Comment3.html
Post2.html User1.xml Group1.txt Comment5.html
Post3.html User2.xml Group1.txt  

In order to pull out the list of comments for a given post we have to query the data for the post and then retrieve each of the attribute values for the Comment column.  We would then retrieve the Post file and each of the subsequent Comment files as indicated by the attributes values.  Each of the comment files would contain the User which submitted the comment and the associated User files would then need to be retrieved. 

Now you're saying to yourself "This seems like a lot of work for something that could be done via a single query in an RDBMS.  What's the deal?"  Here's my take on it:

  • I may not have my head wrapped around it well enough yet and may be going about it all wrong.
  • Since the DB is distributed and the file system is distributed reliability and scalability are very high.
  • Moving the semi-static data out of the DB makes it very easy to integrate into a CDN greatly boosting delivery speed.
  • It's good to think about different solutions to a problem.

Barring extremely high demand I'm not yet convinced that this is a better solution than a few tables in a SQLServer but I'm willing to give it a try in the name of education.  Besides, you can't recommend something without trying it out first and where better to do that than a non-critical side project ;-}

 

TTFN

Friday, May 23, 2008 4:19:11 AM (GMT Standard Time, UTC+00:00)  #    Comments [1]    | 
Monday, June 09, 2008 11:35:27 PM (GMT Standard Time, UTC+00:00)
thank you for this topic.
In order to pull out the list of comments for a given post we have to query the data for the post and then retrieve each of the attribute values for the Comment column. We would then retrieve the Post file and each of the subsequent Comment files as indicated by the attributes values. Each of the comment files would contain the User which submitted the comment and the associated User files would then need to be retrieved.

Now you're saying to yourself "This seems like a lot of work for something that could be done via a single query in an RDBMS. What's the deal?" Here's my take on it:

* I may not have my head wrapped around it well enough yet and may be going about it all wrong.
* Since the DB is distributed and the file system is distributed reliability and scalability are very high.
* Moving the semi-static data out of the DB makes it very easy to integrate into a CDN greatly boosting delivery speed.
* It's good to think about different solutions to a problem.

Barring extremely high demand I'm not yet convinced that this is a better solution than a few tables in a SQLServer but I'm willing to give it a try in the name of education. Besides, you can't recommend something without trying it out first and where better to do that than a non-critical side project ;-}
____________________________________________________________________________________
oks sonucu sbs sonuçları anadolu jet yurtlar düğün salonu
Comments are closed.

Disclaimer
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.


Copyright © 2008 Ek Palmer. All rights reserved.
Pick a theme: