Over the past couple of days I've spent my free brain cycles trying to wrap my head around the various implementations of persistent multidimensional sparse arrays which seem to have cropped up recently. In particular I've been focusing in on Amazon's SimpleDB service, and I have to say that the mind shift from an RDBMS mentality has been a bit difficult.
The underlining concept is pretty straight forward and is something most low-level developers have used at one point in their life or another. Basically it boils down to a multidimensional sparse array that can be persisted, queried and retrieved. A rough translation of SimpleDB terms to RDBMS would be
While at first glance it seems like a reasonable one-to-one mapping there are some very important differences. First of the columns for one row may not be the same as the columns for another row. This is where the "sparse" part comes in. If a column doesn't have an attribute then that column doesn't exist in the row. This also allows for columns to be easily added on the fly. The next significant difference is that an attribute can be comprised of multiple values. Unlike an RDBMS where multiple values for a given column are stored in multiple rows, in SimpleDB a single row may have a single column with dozens of values. Once you wrap your head around this it makes sense and you realize the value of this approach.
Where the difficulty comes in is trying to picture how this would be used in a real world application. I have an active project I'm currently working on (active = touched in the past 30 days) which is basically Yet Another Social Network. It's comprised of Users, Posts, Comments and Grouping information. In an RDBMS these would loosely translate into tables which would relate to each other through a set of mapping tables. Those tables could then be queried using the standard JOIN nomenclature to do things like retrieve the list of Comments for a Post and the Users who authored each of them.
In the SimpleDB world things work a little differently. First off it's not a good idea to store "large" objects in the DB. Instead it is better to store the large objects on something more akin to a file system (aka: S3). So in this example instead of storing the actual Post or Comment in the DB, each would be stored in a separate file in an S3 bucket. What would be stored in the DB would be the information you would want to search and information about the relationships of the various files.
With my current thinking that looks like a single domain/table that looks minimally like the following:
An example of how the data would then appear would be:
In order to pull out the list of comments for a given post we have to query the data for the post and then retrieve each of the attribute values for the Comment column. We would then retrieve the Post file and each of the subsequent Comment files as indicated by the attributes values. Each of the comment files would contain the User which submitted the comment and the associated User files would then need to be retrieved.
Now you're saying to yourself "This seems like a lot of work for something that could be done via a single query in an RDBMS. What's the deal?" Here's my take on it:
Barring extremely high demand I'm not yet convinced that this is a better solution than a few tables in a SQLServer but I'm willing to give it a try in the name of education. Besides, you can't recommend something without trying it out first and where better to do that than a non-critical side project ;-}
TTFN
Disclaimer The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.