Not long ago, I’ve struggled with a system that uses timestamps as entity identifiers. What can I say? Timestamps aren’t meant to identify anything else than a specific point in time. Don’t use them as entity identifiers, ever. If you want to know why, I invite you to read on. The blog post is written in Freytag’s dramatic structure for added effect.
We’ve designed a system that runs on multiple instances that communicate in all sorts of way. A central archive instance stores all data related to measurements. The whole network revolves around the notion of measurement. Measurement data is the most precious data and all instances will either produce or consume data based on these measurements.
Most important for human operators is an instance that lets you view all existing measurement data. Let’s call it the viewer. The viewer displays an overview list of all measurements in a given context and lets the operator choose to view ever more details of any of them. To be able to provide the overview list as fast as possible, we added a cache that holds the information.
This measurement list cache was the source of all kinds of peculiar behaviour of the system. Most, but not all measurement data was incomplete. The list cache entries were assembled from different sources that were available at different times, so it seemed that while one part of the data got written to the cache, another part couldn’t be written for whatever reasons. The operator could load detailed data of some few measurements, but the majority just produced an error message that the data couldn’t be found (despite it being present).
The most obviously broken functionality left the following trace in the log files (paraphrased):
- storing measurement at 2016-02-28T13:25:55.189+01:00 into the list cache - measurement stored [...] - loading measurement at 2016-02-28T13:25:55.189+01:00 from the list cache - error: measurement not found in list cache
So, the system is essentially telling me that it can’t load some data it just stored. As you can imagine, this may lead to some questions about the sanity of the database product underneath.
After some investigation and fruitless integration testing, it dawned me: The problem wasn’t timing or the database. All the bugs could be explained with only one circumstance: Measurements were ultimately identified by their timestamp, the moment the measurement was made. There’s also a location, type and some other information in the identifier for each measurement, but only the timestamp changes between two measurements in the same narrow context. And the timestamp was stored in different precisions, depending on the origin of the measurement identifier. Most identifiers were create at the measurement producing system instances (let’s call them measurers) and had millisecond precision. As soon as they got stored in the production database (but not our development database), they lost the milliseconds. And some of the most important measurement data got exported to third-party systems, using a minute-based precision. So we had one measurement identifier in the system, but with three different types, each mostly incompatible to each other.
That’s why the log excerpt above never occurred in development, but in production: The measurement is stored in the database, the used identifier gets passed around in the software, but a query on the exact same identifier in the database yields no result because the timestamps now differ in the millisecond range. And the strange effects that sometimes, everything worked just fine? That’s when the milliseconds are zero by chance. Given that most actions in the system are scheduled and performed automatically exactly on the zero mark, the zero milliseconds case happened more often than it would in an even distribution.
Our system dealt with three types of measurement identifiers: Millisecond-precise identifiers produced by the measurers, second-precise identifiers used by the measurement list cache and minute-precise identifiers used (and sometimes fed back into the system) by the data export. These identifiers were incompatible even for the same measurement most of the time, but not always. In unit tests, the timestamps were made up and didn’t reveal the problem properly (who thinks about odd milliseconds when making up a timestamp?).
My solution was to pull this incompatibility up into the type system. Instead of one measurement identifier, there are now three measurement identifiers: MillisecondPreciseIdentifier, SecondPreciseIdentifier and MinutePreciseIdentifier. An identifier of higher precision can be converted to an identifier of lower precision, but not the other way around. Everytime a measurement identifier is created, it needs to explicitely state its precision of the timestamp. This made the compiler highlight the problematic usages clearly as type conflicts and therefore dealing with the problem much easier.
Choosing a timestamp as a vital part of a (measurement) identifier was a mistake from the beginning. The greater problem was the omission of the timestamp’s precision. Timestamps perform more like floating-point numbers and less like integers, even if every timestamp can be represented by a long. As soon as I made the precision of each timestamp clear to the compiler, the bugs revealed themselves. The annoying difference between developer and production database would have been detected much sooner, because a millisecond-precise timestamp will now warn in the log files if its millisecond part is zero. As soon as this log entry is seen very often, its clear that something is wrong. The new datatypes not only serve as a clearer API contract definition tool, but also as a runtime sanity check.
If you don’t want to repeat this mistake, keep in mind that each timestamp, date or whatever time-related data type you use will inherently have a maximum precision. As soon as you mix different precisions into the same data type, you’re going to have a bad time. Explicitely state the required precision in your type system and your compiler will keep an eye on it, too.