Hadoop has several other potential advantages over a traditional RDBMS most often explained by the three (and increasing) Vs.
- Volume – It’s distributed MPP architecture make it ideal for dealing with large data volumes. Multi-terabyte data sets can be automatically partitioned (spread) across several servers, and processed in parallel.
- Variety – Unlike an RDBMS where you need to define the structure of your data before loading it, in HDFS, loading data can be as simple as copying a file – which can be in any format. This means Hadoop can just as easily manage, store and integrate data from a database extract, a free text document or even XML documents and digital photos.
- Velocity – Again the MPP architecture and powerful in memory tools (including Spark, Storm and Kafka), which form part of the Hadoop framework, make it an ideal solution to deal with real or near-real time streaming feeds which arrive at velocity. This means you can use it to deliver analytics based solutions in real time. For example, using predictive analytics to recommend options to a customer.
The advent of The Cloud leads to an even greater advantage (although not another “V” in this case) – Elasticity.
That’s the ability to provide on-demand scalability using cloud based servers to deal with unexpected or unpredictable workloads. This means entire networks of machines can spin up as needed to deal with massive data processing challenges while hardware costs are restrained by a pay-as-you-go model. Of course, in a highly regulated industry (eg. Financial Services) with highly sensitive data, the cloud may well be treated with suspicion, in which case you may want to consider an “On Premises” Cloud based solution to secure your data.