blog‎ > ‎Geek corner‎ > ‎

Why do I need a schema anyway?

posted Nov 10, 2013, 10:38 AM by Max Tardiveau   [ updated Dec 6, 2013, 2:48 PM ]
"Art is limitation; the essence of every picture is the frame." — G.K. Chesterton

Summary

This article  is not yet another argument in the tiresome SQL vs. NoSQL debate. I think both technologies have their place. This is an explanation of the benefits of using a schema when the data can benefit from it.

Most NoSQL databases store data either in key/value form, or as XML/JSON documents. In almost all cases, they lack the concept of a schema. This presents certain advantages: programmers can store any data they want, they can change how they store the data over time without migrating old data, etc... That make sense for unstructured data, but when it comes to structured data, these advantages are offset by significant, and (I think) under-reported, downsides regarding the value of the data, and its long-term viability.

In this article, I describe how a schema can be an important asset when dealing with many types of data, and how the concept of schema can be extended to make it even more useful. 

Why a schema?

When writing software, we usually think of what the system is supposed to do. We should also think about what the software is not supposed to do.

In many ways, that's what a schema does. It's a way to define how data should behave, and how it should not behave. It's a way to draw the line between the "good" space, where data is consistent, and the "bad" space, where data is not consistent.

That is the main purpose of a schema. It's not a crutch to help the database engine. It's not an arbitrary set of limits created solely for the purpose of frustrating the programmer's creativity. It's about carving out a well-defined area in an infinite space of possibilities.

Advantages of having a schema

As a communication tool
The first advantage of having a schema is that it brings structure. This may sound tautological but I don't think it is. Having a formally defined structure for your data means that all parts of the system will have at least that much in common. A schema diagram is a great tool for communicating in a team.

As an error-catching mechanism
Having a well-defined schema will catch errors that would go undetected otherwise: null values where there shouldn't be, incorrectly spelled attribute/column names, values out of range, referential integrity, etc...

 Problem Example
 Invalid data Product price = true (meaningless -- should be a number)
 Missing data Line item does not have a price
 Extraneous data Line item has an extra attribute named "Color" -- we don't know what it means
 Referential integrity Order does not belong to any customer 

Discoverability - reports, other apps, etc...
An under-appreciated benefit of having a schema is also the discoverability it brings to your data. A well-defined schema means that other systems may also be able to use your data: ELT tools, reporting tools, even app generators.

For performance
A schema will make indexing easier
A schema also informs how the database retrieves your data. 

For migration
Perhaps most importantly, a schema will make migrating the data much easier. Data tends to outlive applications. Your data will have to be transformed in any number of ways over its lifetime.

As Sarah Mei recently wrote in her remarkably clear and cogent piece:

"Schema flexibility sounds like a great idea, but the only time it’s actually useful is when the structure of your data has no value." -- Sarah Mei

Disadvantages of having a schema


It takes more time up front

You can't store whatever you feel like.

You have to learn some data modeling.


There is nothing wrong about storing schema-less data if that makes sense for your particular problem. But we should stop pretending that NoSQL is the best solution for everything.
Comments