Friday, May 16, 2008

Data is no longer relational

Ok, so the headline may be a little of an over statement for effect. Perhaps I could have said that data that people are interested in is no longer relational but then it wouldn't have been nearly so pithy.

With much respect to Edgar Codd and his invention of the relational model for database storage I think it is time to move forward. Relational databases are great for things like financial models, personnel data and other Enterprise Systems as well as many other standard, repetitive data. It gave a solid reference point to learn data structures and modeling to several generations of budding Computer Scientists.

When I say it is time to move forward I am not saying we should immediately move all data systems to Object Oriented Databases or otherwise induce data chaos. What I do want to push is the idea of unstructured data. Computers are great with rules, with structure and with fundamentally binary relationships. Algorithms are starting to mature around unstructured data (for an example go search google.) but it is still not widespread or well understood.

New exciting algorithms such as Amazon's Dynamo (Werner Vogels is one of my favorite speakers and bloggers on distributed tech, if you are not familiar with him in this space you should be) database are showing in real world situations that distributed systems and distributed data are a reality.

In a lot of system designs because people are so familiar with relational data structures and systems we find Object Models that look like a relational database design. When asked why it looks like this the answers are fairly consistently things like "this is how the database stores it, for speed we need to do the same" or "it just made sense when we pulled the DBA in to help us with the model."

Objects are not relational! They are objects. Then when you get into full structures of objects or object trees there are relationships but it is not the same as a relational database. Especially as we start to build and mature distributed system algorithms it doesn't make sense to use a centralized data store. If the data can be broken up, distributed and stored with the algorithms that will use it performance will improve.

In fact I would argue that the elusive SLA of a system response can begin to be discussed if we can tie the data to the processing. Granted there are new complexities in this model for synchronization, segmentation and consistency but there are ways to solve them. Similarly consistent access to the same servers is also possible.

What other great examples of distributed computing and distributed data storage have you seen?

No comments: