<home

The Relational Model Turns 40

August 18, 2009

The Relational Data Model is 40 years old. Edgar Codd first described the idea that databases could be constructed from mathematical relations in an internal IBM Research report in August 1969. The following year his more famous paper on the same topic was published for public consumption. So began a revolution. Data was liberated from its physical representation and thereafter the most complex data processing problems could be solved using only a few simple building blocks (the eight basic operators of Codd’s relational algebra). Few concepts in the history IT have made such an impact or endured for so long.

At this point I could try to write something profound about what the next 40 years might hold... I’m not foolish enough to predict anything like that far ahead though! For sure there is a lot more that could be done. The SQL systems that most of us use are still some way short of what Codd envisaged even four decades ago. It seems pretty clear that technology already exists to make possible much better database systems than we have today. Solid-state storage for example should make it possible to achieve far greater Physical Data Independence (database representation that is more abstracted from its internal storage structures). Richer type support in databases is something that database vendors have been improving but there is still some way to go – in particular type-inheritance isn’t well-supported by most DBMSs. Self-optimising databases is one area where I expect we’ll see a lot more innovation.  And then of course there is ever more scalability and distributed database technology, which brings me to the cloud and distributed databases…

My potentially subversive question is this: Do we actually need new data models for the cloud? It seems that we do need different and hopefully better implementations of database systems than we have in centralised data centres. We will go on using XML stores and other types of storage too. But the relational model still ought to have a central role to play in cloudy, distributed data stores as well as the [on-premises] ones.

Brewer conjectures that we may need to accept partial or eventual consistency of data if we desire a system that’s both available and partitioned. Two observations about this seem important. Firstly, consistency guaranteed at some point will continue to be critical for the vast majority of non-trivial systems. No business is ever giving up consistency and accuracy as desirable goals because trust is the basis of every business model out there. Secondly, eventual consistency means that fewer business rules are implemented in the declarative manner that we are used to, i.e. they are evaluated some time after the event rather than at transaction time.

This could actually suit the relational model very well. Because the RM doesn't have any navigational structures it can easily accommodate data that is late arriving. Provided no constraint is actually broken (as if we aren’t using declarative referential integrity for example) it is perfectly OK for some tuples to arrive “late” in our relational model. We can still query and join on the tuples that are present, we can derive place-holders for tuples that are missing (think outer joins) and we can apply integrity rules to data when it does arrive. This is in stark contrast to navigational models (like XML and other hierarchies) where an element must exist before its child elements can exist.

Another desirable property of large distributed systems is a flexible and dynamic structure. RM is good here too. In principle relational schemas can be highly dynamic but unfortunately this is one area where SQL systems are found wanting. In a relational system, changing the set of attributes of a relation ought to be as simple as writing the query that transforms the data and then assigning the result of that query to a new relation variable. Potentially this needs to involve little or no change to the underlying storage and can therefore be a near-instantaneous metadata operation. SQL DBMSs make extremely heavy work of such a simple task however, which is why SQL databases are often difficult to change and adapt.

Yet another thing on the distributed storage wish-list is rich type support – support for storing documents, images, XML and other objects in databases just as easily as strings and numbers. As already mentioned, SQL systems are doing some of that already and vendors will no doubt continue to improve type support.

The many familiar and valid criticisms of SQL (see for example the “NoSQL” line of thought) are just that – problems with the SQL model and SQL DBMS limitations rather than with the relational model. (See also NoSQL and the Relational Model: don’t throw the baby out with the bathwater). So even after 40 years I still think the relational model is the coolest game in data-land, and perhaps the best is still to come because RM’s potential has never yet been fully realised in mainstream database software. Kudos Mr Codd!