blog‎ > ‎

Geek corner


The C in ACID

posted Nov 12, 2013, 12:57 AM by Max Tardiveau

Everyone who works with databases is familiar with the acronym ACID, which lists the attributes of a proper transaction. It should be:
  • Atomic
  • Consistent
  • Isolated
  • Durable
We all know about the A and the D -- they're relatively intuitive. Far fewer people truly understand the I, but that's for another article. Today, I'd like to focus on the C. What exactly does it mean for data to be consistent?

Consistent means that the data is in a valid state; in other words, it follows the definition of the schema. For instance, if the column is defined as NOT NULL, it shouldn't ever be null. If it's defined as a foreign key, then the referred object should always exist. The list goes on.

Many databases allow you to go further and define domains. For instance, perhaps the customer's status should be one of Bronze, Silver or Gold, or the customer's age should be between 0 and 125.

These definitions are good and useful because they are easy to declare, and once they are declared, you don't have to think about them. The database is going to do whatever it needs to do to make sure that these definitions remain true, no matter what happens to the data.

For anything more complicated, you typically have to use triggers and stored procedures -- not that there's anything wrong with that, mind you. Triggers and stored procedures participate in transactions, and therefore are part of consistency. In fact, they can be considered to be part of the schema, if you use the term loosely.

But of course, triggers and stored procedures are going to be vendor-dependent, and are often difficult to write and debug. In addition, they add to the database load, which can lead to scalability issues. So the non-trivial logic is often defined in the middle tier, using a language like C#, Java, Python, etc...

There is a big gap between declaring a schema, and writing procedural code. Defining a constraint as part of a schema is (comparatively) easy, and you don't have to explain what it means to the database. For instance, a foreign key definition will automatically cover inserts, updates and deletes. Not only that, but it's also self-documenting: everyone will know what it means.

As soon as you start writing procedural code (whether in triggers and stored procedures, or other languages), you're leaving all that behind, and taking responsibility for a lot of things. You have to make sure that your code does the right thing at the right time, and in particular, you're responsible for dealing with the various dependencies between the various bits of code that you may have. This problem is exacerbated when the logic governing the data is expressed in more than one place. It's not unusual to have some of that logic defined in triggers and stored procedures, some in the middle tier, and (shudder) even some in the presentation layer. Getting a global view of how all this logic works is daunting. Changing any of it can be a frightening proposition, since there may be a lot of non-obvious dependencies that might be tripped by a seemingly innocent change.

Wouldn't it be nice to be able to do more than trivial definitions as part of the schema? What if we could extend schema definition to include higher-level constructs, like complex derivations, aggregates, and multi-table validations? That wouldn't solve all of our problems, but it would allow us to work at a higher level of abstraction.

That's what database reactive programming aims for. We're pushing the declarative aspect of database schemas to a whole new level. By doing so, we want to capture more of the logic as declarations, and less as code.

Why do I need a schema anyway?

posted Nov 10, 2013, 10:38 AM by Max Tardiveau   [ updated Dec 6, 2013, 2:48 PM ]

"Art is limitation; the essence of every picture is the frame." — G.K. Chesterton

Summary

This article  is not yet another argument in the tiresome SQL vs. NoSQL debate. I think both technologies have their place. This is an explanation of the benefits of using a schema when the data can benefit from it.

Most NoSQL databases store data either in key/value form, or as XML/JSON documents. In almost all cases, they lack the concept of a schema. This presents certain advantages: programmers can store any data they want, they can change how they store the data over time without migrating old data, etc... That make sense for unstructured data, but when it comes to structured data, these advantages are offset by significant, and (I think) under-reported, downsides regarding the value of the data, and its long-term viability.

In this article, I describe how a schema can be an important asset when dealing with many types of data, and how the concept of schema can be extended to make it even more useful. 

Why a schema?

When writing software, we usually think of what the system is supposed to do. We should also think about what the software is not supposed to do.

In many ways, that's what a schema does. It's a way to define how data should behave, and how it should not behave. It's a way to draw the line between the "good" space, where data is consistent, and the "bad" space, where data is not consistent.

That is the main purpose of a schema. It's not a crutch to help the database engine. It's not an arbitrary set of limits created solely for the purpose of frustrating the programmer's creativity. It's about carving out a well-defined area in an infinite space of possibilities.

Advantages of having a schema

As a communication tool
The first advantage of having a schema is that it brings structure. This may sound tautological but I don't think it is. Having a formally defined structure for your data means that all parts of the system will have at least that much in common. A schema diagram is a great tool for communicating in a team.

As an error-catching mechanism
Having a well-defined schema will catch errors that would go undetected otherwise: null values where there shouldn't be, incorrectly spelled attribute/column names, values out of range, referential integrity, etc...

 Problem Example
 Invalid data Product price = true (meaningless -- should be a number)
 Missing data Line item does not have a price
 Extraneous data Line item has an extra attribute named "Color" -- we don't know what it means
 Referential integrity Order does not belong to any customer 

Discoverability - reports, other apps, etc...
An under-appreciated benefit of having a schema is also the discoverability it brings to your data. A well-defined schema means that other systems may also be able to use your data: ELT tools, reporting tools, even app generators.

For performance
A schema will make indexing easier
A schema also informs how the database retrieves your data. 

For migration
Perhaps most importantly, a schema will make migrating the data much easier. Data tends to outlive applications. Your data will have to be transformed in any number of ways over its lifetime.

As Sarah Mei recently wrote in her remarkably clear and cogent piece:

"Schema flexibility sounds like a great idea, but the only time it’s actually useful is when the structure of your data has no value." -- Sarah Mei

Disadvantages of having a schema


It takes more time up front

You can't store whatever you feel like.

You have to learn some data modeling.


There is nothing wrong about storing schema-less data if that makes sense for your particular problem. But we should stop pretending that NoSQL is the best solution for everything.

The C in ACID

posted Jul 26, 2013, 3:45 AM by Max Tardiveau   [ updated Jul 26, 2013, 4:04 AM ]

What does it mean to be consistent?




I've been thinking about ACID recently. ACID describes how transactions should behave in databases. It means:
  • Atomic : everything in a transaction goes in, or nothing goes in, but nothing in-between
  • Consistent : more on that in a minute
  • Isolated: the user should have the illusion (as much as possible) that s/he is the only one using the database; other concurrent transactions should (ideally) not interfere.
  • Durable: whatever the transaction does, if it succeeds, should be permanent.

The A is what most programmers think of when they think of transactions. That's because most programmers have an appallingly limited knowledge of this stuff. Atomicity is important, obviously, but it's not the be-all and end-all of transactions, not by a long stretch.
The D is (rightfully) taken for granted by most people, so let's not worry about it.
The I is poorly understood by most programmers, but that's not relevant here, so let's ignore it.
The C is what I'd like us to think about, because it's always been a bit fuzzy.

What does it mean for data to be consistent? Consistent with what? So far, for almost all relational databases, it means that the data must conform to the schema:
  • data types must be respected -- e.g. no putting string in numbers
  • required columns can not be null
  • foreign keys must be valid
  • some very limited additional validation may be possible, depending on the database, e.g. age should be between 0 and 120, sex must be M, F or O, etc...

But that's about it1. If you want to go beyond that, you have to use triggers, stored procedures and similar mechanisms. Wikipedia says about consistency:

This does not guarantee correctness of the transaction in all ways the application programmer might have wanted (that is the responsibility of application-level code) but merely that any programming errors do not violate any defined rules.

And that's where the industry has been for the past 20 years: you take a database system, and you put a layer of code on top of it. It's difficult, it's expensive, and goodness help you if your logic changes.

That is the disconnect that Espresso Logic solves. We offer a solution that provides all the advantages of relational databases, but with a much higher level of declarative consistency. We make it possible to think about data at a higher level.

We can be the C in ACID. We can bring deep consistency to the database, not the superficial consistency that is still the standard today. And we can make it easy.



1 - Note that NoSQL databases don't even have most of these; they're even more of a free-for-all, which is fine for many applications, but hopelessly inadequate for many others.

1-3 of 3