In may previous posting I compared three of the most popular NoSQL databases to queries against native Hadoop and HDFS. In this article, I'll provide a more detailed description of the strengths and limitations Cassandra.
If you haven't already, it may be worth reading my articles on Hadoop or NoSQL first.
Why use Cassandra?
Cassandra is a highly scalable, distributed database originally built at Facebook. It's now a top-level Apache open source project with widespread adoption, and is deployed at scale by Netflix, eBay, Spotify and Apple.
It's main strengths are:-
Ease of use: Accessed via a sub-set of SQL (CQL), it supports Select, Insert, Update and Delete operations against a Table and Column structure familiar to many database developers.
Built for Volume and Velocity: A Netflix Benchmark test demonstrated around 2.8 Billion operations per hour using a mixed read/write workload. It managed an average write latency of just 1.75 milliseconds, and an average throughput of 200,000 writes per second against a total of 70 Terabytes of data.
Built for Scalability on Premises or the Cloud: While you can built your own in-house cluster, Cassandra is well supported by the Amazon Elastic Compute platform. The entire Netflix benchmark ran on 285 nodes on the EC2 platform costing around $400 per hour. Then zero, once the operation completed.
Transparent Node Balancing: Data is automatically sharded (distributed) across nodes in the cluster for resilience. As demand grows, and new nodes are added, the system transparently distributes the data to the new servers. On a cloud based deployment this could potentially be automated providing elastic scalability.
Highly Fault Tolerant: Many NoSQL databases (eg. HBase) use a "master" node for write operations with an optional fail-over capability for recovery. Cassandra works in a "ring" topology with no single point of failure. Writes are automatically replicated to three (or more) nodes in a cluster, and data can be replicated to another data centre in real time providing built in disaster recovery.
Hadoop Integration: Cassandra works well with Hadoop and HDFS and is accessible via Apache Spark with support for multiple languages including Java, Python, Ruby and C++.
Geographically Distributed Systems
The diagram above illustrates a globally distributed system whereby users gain millisecond performance on a cluster of machines at each location, while results are automatically replicated globally. Using Amazon EC2, if the Singapore based system crashed, the workload could automatically resume using London and New York based servers at a reduced speed.
This benefit was illustrated in April 2011 when a major US-East outage on Amazon left many web sites down, but Netflix operations running Cassandra were unaffected because of cross-regional replication in place.
Why Avoid Cassandra?
Like any specialised, highly tuned tool, Cassandra is built to solve a specific problem. High volume, low latency read/write lookups from a massive user base. It does however have significant limitations:-
No Transaction Support: An RDBMS like Oracle provides full read consistency and transaction isolation which is absent from nearly all NoSQL databases. On Cassandra, every write operation is immediately visible to all users, and this has massive implications for many applications. Caveat Emptor - Buyer beware.
Joins not supported: Cassandra is not a relational database, and join operations are not supported. It does support simple parent-child relationships using compound keys, but database design must be focussed on the query access paths, using extensive data denormalisation to maximise performance.
Not suitable for OLAP: Designed for fast primary key lookups, it's not suitable for data warehouse type queries which aggregate data over millions of rows. While it supports secondary indexes, they're not recommended for high cardinality (unique) values as performance degrades rapidly. This compares poorly to a traditional RDBMS which often handle a mixed workload with ease.
Limited Sort operations: Most read operations focus on a single row, and there's no option to sort results based upon an arbitrary column. It is possible to store and retrieve data in a predefined key sequence, but again, this is inflexible compared to a typical RDBMS.
Cassandra and Eventual Consistency
An Oracle database supports immediate consistency as there's only one copy of the data. However, on a distributed database, the data is replicated to multiple independently running nodes, and to maximise throughput, we may relax the consistency rule, and settle for eventual consistency.
The diagram below illustrates the situation where an update is applied to one node, and eventually written to the two other replicas. In the interim (maybe a few microseconds), a reader can potentially read the old (stale) copy of the data. Eventually, all three replicas will have consistent results - hence the name.
While not the best solution in every case, the Netflix Benchmark demonstrates, the benefits of relaxing the consistency rule, as read throughput increased from 600,000 to nearly a million reads per second. This clearly illustrates the trade off of performance against consistency and resilience. If you don't need the guaranteed absolute latest entry, it's a significant performance gain.
Cassandra is also highly flexible, and supports 11 levels of consistency. The most commonly used are:-
ONE - Eventual Consistency: For maximum performance on reads this returns the first available version from the three (or more) replicas. On writes, the operation is complete once written to one node which provides write throughput at the expense of resilience.
QUORUM - Mid ground: A balance of performance and resilience, this marks a write operation complete as soon as a majority (two of the three) replicas are written, and combines data from multiple replicas on read, taking the latest data available.
ALL - Maximum Resilience: The slowest option, this favours resilience over performance as a write must be written to all replicas before it's marked complete.
Note: If an operation is written using ALL consistency, it can be safely read using ONE which maximises read throughput over writes. This gives huge flexibility (and opportunity for hard to identify bugs), but it's a powerful option to have available.
Warning: Transactions not supported
Although Cassandra has flexible support for consistency, this must not be confused with "read consistency" or support for transactions - a feature often taken for granted on traditional database platforms.
Effectively on Cassandra, every write operation is implicitly committed, and immediately visible to all readers without delay.
Take a simple example. An application which transfers money from a savings to a current account would typically involve two operations committed in a single transaction. One to deduct from the savings account, and another to credit the current account. In Cassandra, another user could read the balance of both accounts mid-way through with potentially catastrophic results.
Cassandra does support a lightweight test and set operation to set a value depending upon a value, and the ability to batch load multiple inserts, but this provides at most, a rather basic level of lightweight transaction support.
The NewSQL alternative
Since originally writing this article I've discovered a new class of database - the NewSQL database. This has all the low latency, high velocity advantages of a NoSQL database, but with the added advantage of full ACID compliance, and the flexibility of the relational data model. You can see more about this in the article comparing Oracle, NoSQL and NewSQL database technology.
In conclusion, if you have a web or sensor streaming based application with a massive or highly unpredictable workload needing very low latency read/write operations using mainly unique keys Cassandra is an excellent option. If you can deploy a cloud based solution, and need potentially massive scalability, all the better.
If however you're building a system where data accuracy and consistency is important, you may want to consider the range of NewSQL databases including VoltDB, MemSQL and NuoDB.
If however you have a Data Warehouse workload, summarising billions of rows, a database like Vertica might be a better option. Likewise, if you need full transactional support you should look elsewhere - perhaps at VoltDB.
There are over 150 NoSQL databases available, many of which support massive throughput and scalability. However, where Cassandra is unique is in providing tuneable consistency along with high level of resilience and native replication across a potentially globally distributed system.
Provided you can work within the lack of transaction support and eventual consistency, it's a potentially powerful tool to deploy.