The Traditional Solution
The diagram above illustrates a common architecture referred to as the Lambda Architecture which includes a Speed Layer to process data in real time with a Batch Layer to produce an accurate historical record. In essence, this splits the problem into two distinct components, and the results are combined at query time in the Serving Layer to deliver results to the user.
Keeping code written in two different systems perfectly in sync was really, really hard. - Jay Kreps on Lambda (LinkedIn)
While the Lambda Architecture has many advantages including decoupling and separation of responsibility, it also has the following disadvantages:-
- Logic Duplication: Much of the logic to transform the data is duplicated in both the Speed and Batch layers. This adds to the system complexity and creates challenges for maintenance as code needs to be maintained in two places – often using two different technologies.
- Batch Processing Effort: The batch processing layer assumes all input data is re-processed every time. This has the advantage of guaranteeing accuracy as code changes are applied to the data every time, but potentially places a huge batch processing burden on the system.
- Serving Layer Complexity: As data is independently processed by the Batch and Speed layers, the Serving Layer must execute queries against two data sources, and combine real time and historical results into a single query. This adds additional complexity to the solution, and may rule out direct access from some dashboard tools or need additional development effort to facilitate.
- NoSQL Data Storage: While batch processing typically uses Hadoop/HDFS for data storage, the Speed Layer needs fast random access to data, and typically uses a NoSQL database, for example HBase. This comes with huge disadvantages including no industry standard SQL interface, a lack of join operations, and no support for ad-hoc analytic queries.
While the only transformation tool available was Map Reduce with NoSQL for data storage, the Lambda Architecture was a sensible solution, and it has been successfully deployed at scale at Twitter and LinkedIn. However, there are more advanced (and simple) alternatives available.
The NewSQL Based Solution