The Elastic Parallel Processing technology deployed by Snowflake, has all the performance advantages of MPP solutions, but without the drawbacks including:-
- Data Placement:Is no longer needed. Unlike MPP systems which permanently store data in each cluster, Snowflake stores all data in a common Storage Service which is available to all clusters. When a query is executed, the Service Layer determines the necessary data placement at run-time, leaving the designer and developer free to concentrate on the business challenges.
- Data Skew: Is also not an issue, as again data is permanently stored separate from the compute clusters. This means, if one node in the cluster is over-loaded, the other free nodes can simply pick up the work needed. Furthermore, there’s no need to transfer gigabytes of data between nodes, only the pointers to data need be transferred, and data is cached in SSD storage in the compute clusters for performance.
- Secondary Indexes: Don’t place a cap on throughput, because Snowflake doesn’t have traditional B-Tree indexes. Instead, data is stored in variable length Micro-Partitionsof between 1-100Mb in size, and the min/max value of each column is held which supports automatic partition elimination on every column.
- Over provisioning of storage: Is not an issue, as you simply pay for the storage used on a fixed monthly charge per terabyte, with compute processing charged by the second. This, along with the ability to natively handle structured and semi-structured data in the database, makes it a great solution for a Data Lake.
The diagram below illustrates one the greatest benefits of the EPP architecture – the ability to run multiple parallel compute clusters, each sized to a different work-load, but potentially running against the same data. This means ELT loads can be run separately from data intensive data science tasks, and avoid impacting the sub-second response times needed on corporate dashboards.