You probably forget too much, too soon and way too definite

eraser_1If you felt spoken to by the blog title, you might relax a bit: I didn’t mean you, but your application. And I don’t suggest that your application forgets things, but rather removes them deliberately. My point is that it shouldn’t be able to do so.

A disaster waiting to happen

Try to imagine a child that is given a sharp scissors to play with by its parents. It runs around the house, scissors in hand, cutting away things here and there. Inevitably, as mandated by Murphy’s law, it will stumble and fall, probably hurting itself in the process. This scenario is a disaster waiting to happen. It is a perfect analogy of your application as long as it is able to perform the delete operation.

A safe environment

Now imagine an application that is forbidden to delete data. The database user used by the program is forbidden to issue the SQL DELETE command. This is like the child in the analogy before, but with the scissors taken away. It will leave a mess behind while playing that needs an adult to clean up periodically and it will fall down, but it won’t stab itself. If you can run your application in such a restricted environment, you can guarantee a “no-vanish” data safety: No data element that is stored will disappear ever.

Data safety

In case you wonder, this isn’t the maximum data safety level you can (and perhaps should) have. There is still the danger of accidental alteration, where existing data is replaced or overwritten by other data. To achieve the highest “no-loss” data safety level, you need to have a journaling database system that tracks every change ever made in an ever-growing transaction log. If that sounds just like a version control system to you, it’s probably because it essentially will be such a thing.
But “no-vanish” data safety is the first and most important step on the data safety ladder. And it’s easy to accomplish if you incorporate it into your system right from the start, implementing everything around the concept that no deletion will occur on behalf of the system.

Why data safety?

But why should you attempt to adhere to such a restriction? The short answer is: because the data in your system is worth it. We recently determined the immediate monetary worth of primary data entries in one of our systems and found out that every entry is worth several thousand euros. And we have several hundred entries in this system alone. So accidentally deleting two or three entries in this system is equivalent to wrecking your car. Who wouldn’t buy a car when the manufacturer guarantees that wreckages cannot happen, by design? That’s what data safety tries to achieve: Giving a guarantee that no matter how badly the developers wreck their code, the data will not be affected (at least to a degree, depending on the safety level).

No deletion

The best way to give a guarantee and hold onto it is to eliminate the root cause of all risks. In our digital world, this is surprisingly easy to accomplish in theory: If you don’t want to lose data (by accidental removal), prohibit usage of the delete operation on the lowest layer (probably the database). If your developers still try to delete things, they only get some kind of runtime error and their application will likely crash, but the data remains intact.

Implementation using a RDBMS

If you are using a relational database system, you should be aware that “no-vanish” safety comes with a cost. Every time you fetch a list of something from your database, you need to add the constraint that the result must only contain “non-obsolete” entries. Every row in your main tables will adopt some sort of “obsolete” column, housing a boolean flag that indicates that this row was marked as deleted by the application. Remember, you cannot delete a row, you can only mark it deleted using your own mechanisms.
The tables in your database holding “derived data”, like join tables or data only referenced by foreign keys, don’t necessarily need the obsolete flag, but will clutter up over time. That’s not a problem as long as nobody thinks that all entries in these tables are “living data”. The entries that are referenced by obsolete entries in the main tables will simply be forgotten because they are inaccessible by normal means of data retrieval.

The dedicated hitman

Using this approach, your database size will grow constantly and never shrink. There will come the moment when you really want to compact the whole thing and weed out the unused data. Don’t give the delete right back to your application’s database user! You basically don’t trust your application with this (remember the child analogy). You want this job done by a professional. You need a dedicated hitman. Create a second database user that doesn’t has the right to alter the data, but can delete it. Now run a separate job (another application) on your database within this new context and remove everything you want removed. The key here is to separate your normal application and the removal task as far as possible to prohibit accidental usage. If you think you’ve heard this concept somewhere before: It basically boils down to a “garbage collector”. The term “hitman” is just a dramatization of the bleak reality, trying to remind you to be very careful what data you really want to assasinate.

Implementation using a graph database

If this seems like a lot of effort to you, perhaps the approach used by graph databases suits you better. In a graph database, you group your data using “graph connections” or “edges”. If you have a node “persons” in your database, every person node in the database will be associated to this node using such a connection. If you want to remove a person node (without throwing the data away), you remove the edge to the “persons” node and add a new edge to the “deleted_persons” node. You can probably see how this will be easier to handle in your application code than ensuring that the obsolete flag is considered everywhere.

Conclusion

This isn’t a bashing of relational database systems, and it isn’t a praise of graph databases. In fact, the concept of “no deletion” is agnostic towards your actual persistence technology. It’s a requirement to ensure some basic data safety level and a great way to guarantee to your customer that his business assets are safe with your system.

If you have thoughts on this topic, don’t hesitate to share them!

2 thoughts on “You probably forget too much, too soon and way too definite

  1. Great post, didn’t think about it this way before, thanks! Maybe one could implement a periodical delete job which removes entries that are marked for deletion and older than x years? Have you ever had to think about information lifecycle management? There is a sap tool out there which lets you create a set of rules to determine the lifetime of data and if archivation is necessary. The data that does not need to be archied is simply deleted after the specified time. The data to be archieved is written on WORM-like storages with a time stamp when it has to be deleted (e.g. for employee data). The deletion on WORM-like storages is a special operation that needs to be supported by the hardware. Quite complex thing and (mostly) only relevant for bigger enterprises…
    May the force be with you,
    Chris

  2. Pingback: Thoughts on Design | Schneide Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s