A Time Series Database – how do I choose?
A Time Series Database (TSDB) is defined simply as a software system that is designed to handle time series data the best possible way. The bigger question is, what is a time series? The answer is that it is a series of points of data arranged in their order in time, usually captured at specific intervals.
The TSDB, then, collects, analyzes, plots and otherwise decides what to do with the data, based on the predefined algorithms of the database used. Outputs from such systems fall under a range of basic descriptive terms, depending generally on the type of data begin collected over time, such as profiles, curves, traces, etc. For example, a time series for student grades is often referred to as a student grades time series tables, while one for analyzing learning outcomes being used in Ireland is called time series clustering. We often hear terms such as price curve, bell curve, load profile, etc., which all may be used to describe time series in different areas.
In fact, the same muliplicity of operations are performed for analyzing a variety of these time series. TSDBs are optimized for these operations, where other systems may not be practical. TSDBs impose their model on the time series, rather than the other way around.
With the above points in mind, what kind of choices do I have for my particular needs?
The short answer is that there are a lot of choices. Well, as of April 9th, 2016, there were… well… you can count them here! You will notice if you scroll down to the comment section there that even then, which is ancient history in Internet time, people were shouting out to include others. Compare that list to this one to see how much has changed in just months.
Therefore, it might be very handy to have a short list of the top time series databases to choose from for your IoT project for example. Please remember that grading anything is somewhat subjective. To quote Steven Acreman, with more than 12 years experience in Operations and DevOps,
“Databases are a crazy topic and it seems everyone has an opinion. The trouble is that opinions are like belly buttons. Just because everyone has one it doesn’t mean they are useful for anything.”
Everyone has their own idea of what makes something better or best, so please decide for yourself what will serve you most efficiently.
One point I would really like to raise right here is this: Of those who have taken the time to do the comparisons, it has been determined that time series databases built from scratch are much faster than those sitting on very popular non-purpose-built databases such as Hadoop, Riak KV or Cassandra. If you have an issue with this analysis, please share it in the comments! Remember that these are all open source time series databases, which means, in my opinion, that they are all developed by artists who do it for their love of programming.
This is a top ten list, but it isn’t an ordered list. It would be interesting to get some comments below to see how readers think they should be ordered. It would be even more interesting to see if anyone agrees with every entry on this list, and as Spock would say, “Fascinating,” if two people agree on the order…
InfluxDB scores right up near the very top on several software blogs, making it into the top ten multiple times.
Druid scores within the top 10 time series databases, again on multiple lists.
Riak TS is again in the top 10 on several sites. Interestingly enough, they advertise themselves as being engineered to be faster than Cassandra. Isn’t that interesting?
Prometheus has been around forever in Internet time, and it still manages to rank right up there among users. Users say it may need a few tweaks, due to the fact it wasn’t specifically designed as a time series database, but it’s still a powerful option.
Graphite, which Prometheus compared themselves to (see above) is a top ten site, obviously ranking very high in the opinion of the engineers at Prometheus! It also ranks high on several other very respected sites. The crew at Graphite must have a real sense of humor. The statement on their website reads, “Graphite does three things: Kick *ss. Chew bubblegum. Make it easy to store and graph metrics. (And it’s all out of bubblegum.)” It’s almost worth choosing them for the fun you could have with their crew! They also say that it runs just as well on cheap hardware or on the Cloud. Graphite has been around since 2006, making it almost prehistory. The fact that it’s still here almost makes it a contender for the top ten list on that fact, alone.
OpenTSDB presents itself as The Scalable Time Series Database, with the ability to “store and serve massive amounts of time series data without losing granularity.” Multiple users who have written about this rank it somewhere in their personal top ten.
Elasticsearch has also landed in the top ten on several software blogs, including, but not limited to, the Netsil Inc blog, who did a comparison of time series databases against Druid, which they use. Netsil also gave high marks to Cassandra (see above).
DaltaminerDB gets top marks from me because of the dalmatian they have on their home page. Seriously though, if you want blazing speed in a reliable database, this one does deserves to be in the top ten. Plus they like dogs. They might be at or near the top on everyone’s list once they’ve been around a little longer.
Blueflood is built by the Rackspace engineers. They lovingly call it “a giant distributed calculator that loves numbers.” Blueflood actually uses Cassandra, among other things, because of its high write throughput peak of 60,000 points/sec on a single box, as well as the very reliable support, but uses Elasticsearch as an index. It is billed by some as being a decent replacement for Graphite.
Cassandra is old and slow, but it’s still the standard by which many others measure themselves or are measured. While they are WAY slower than some of the TSDBs available now, a LOT of engineers still like using it for comparison purposes. It is also still used as starting point for new databases.
Scylla is one to look at. There aren’t enough people talking about it, yet, but dang… it looks good. It’s billed as the world’s fastest NoSQL database. According to their website and also anyone I’ve found who have tried it, it’s “fully compatible with Apache Cassandra at 10x the throughput and jaw dropping low latency.” It’s also listed on MISFRAME as a much faster C++ implementation of Cassandra. Every comment I’ve read on Scylla has been positive, and they all say it really is exceptionally fast. It isn’t in the top ten on this list, but it might be soon!
Please leave your comments below, and tell us what your thoughts are on this list.
OpenTSDB is my first « go to » TSDB, probably because, as you mention in the article, it is created by a community of software « artists » who will never settle for good enough.
The unique characteristic of time series database is that it provides various features based on time. That is, the DBMS basically calculates at a high speed and execute query analysis. Clients are not just requesting to process time series data, but also the data insertion and analysis fast.
Demands from clients:
1. High-performance data insertion in real-time
2. High-performance data analysis in real-time
These requests are somewhat reasonable these days as the number of IoT devices which create time series data, are rapidly increasing. When the client is requesting high-performance data insertion (from as little as tens of thousand to as many as hundreds of thousands), the issue is how to process too many data. In addition, data needs to be compressed in real-time in order to obtain enough data storage. Some solutions like Splunk calls these data as “machine data” and it is the special term referring to time series data. The reason behind the high-performance data analysis is that they want to store data at a high speed and analyze them easily. It doesn’t require the brain to know that it needs standard SQL, and flexible SQL query would be perfect for receiving the results while inserting tens of thousand data per second. In addition, it is even better to have “partition pruning” in order to improve the performance greatly when conducting data analysis based on time. In general, the amount of data for a month is well over tens of billions. Thus, disk-based database must process massive amount of data. Unfortunately, there is no conventional database on the market that can satisfy the two requirements and that is why new solutions for time series database are entering the market.