Snowflake Spilling to Storage in Plain English

This article will demonstrate how a simple Snowflake virtual warehouse change can lead to queries running over a hundred times faster and 40% cheaper.  Plot spoiler - it involves spilling to storage.

What is Spilling to Storage?

The diagram below illustrates the internals of a Snowflake X-Small Virtual Warehouse, which includes eight CPUs, memory, fast SSD local storage and remote cloud storage.

Snowflake Virtual Warehouse Internals

Spilling occurs when Snowflake needs to sort data, including executing an ORDER BY, but more often because of a GROUP BY.  Initially, Snowflake attempts to complete the sort operation in memory, but once that's exhausted, data is temporarily split to local storage.  For huge sorts, even the local SSD can be used up, in which case data will be spilt to remote cloud storage.

The key point is main memory access is much faster than SSD, which is considerably faster than cloud storage.  To put this in context, if a single CPU cycle took one second, access to main memory takes around 6 minutes, access to SSD around three days and access to cloud storage up to a year.

Spilling to storage can have a significant impact on query performance.

Benchmarking Spilling to Local Storage

To demonstrate the impact of spilling, I executed the same query against a range of virtual warehouse sizes from X-Small to X4-Large and recorded the results.  The query below fetches and sorts over 288 billion rows, sorting around a terabyte of data.

select ss_sales_price
from snowflake_sample_data.tpcds_sf100tcl.store_sales
order by ss_sales_price;

The table below shows the improvement in execution time as we increased the warehouse size.  It shows a speed improvement of over 140 times from 7.5 hours to 3 minutes while the cost remained around the $20 level.

The graph below illustrates the reduction in execution time as the warehouse size is increased.  It again demonstrates run time is reduced by 50% each time.

This is consistent with previous tests which show that, provided the query is large enough, increasing the warehouse size produces the same results twice as fast for the same cost.

Benchmarking Spilling to Remote Storage

As described above, spilling to remote storage is even slower but needs a monster query.  The SQL statement below was executed on varying warehouse sizes and the results recorded.

SELECT ss_sales_price
From snowflake_sample_data.tpcds_af100tcl.store_sales
order by SS_SOLD_DATE_SK, SS_SOLD_TIME_SK, SS_ITEM_SK, SS_CUSTOMER_SK, 
         SS_CDEMO_SK, SS_HDEMO_SK, SS_ADDR_SK, SS_STORE_SK, SS_PROMO_SK, SS_TICKET_NUMBER, SS_QUANTITY;

The table below shows the results of running the above query, which sorts ten terabytes of data on varying warehouse sizes.  It shows the query running over a hundred times faster on an X4LARGE while the query cost fell from over $150 to $95, a 40% reduction.

Snowflake Spilling to Storage by Warehouse Size

One interesting point worth noting, is that when the execution difference is greater than 100%, this indicates the query cost is cheaper at each step as the charge rate is doubled each time, but the query runs more than twice as fast.  Clearly spilling to remote storage adds to both run time and cost as the query cost levels out after an X2LARGE as remote spilling is eliminated.

Best Practices to Reduce Snowflake Spilling

Effectively, there are two approaches to reduce spilling to storage on Snowflake:

  1. Reduce the volume of data sorted.

  2. Scale up the warehouse size.

My article on tuning Snowflake query performance includes several techniques to reduce the data fetched, which will in turn, reduce the volume of data sorted and help eliminate spilling to storage.

One of the easiest ways of improving query performance and avoiding spilling is to include a LIMIT clause in the query.  To demonstrate this, the following query was executed without and then with a LIMIT clause, which returned the top 10 entries or all entries.

select * 
from store_returns 
order by sr_reason_sk 
limit 10;

The screenshot below shows that including a LIMIT improved query performance by over 60 times from over two hours to under two minutes.

Snowflake LIMIT to reduce Spilling to Storage

It's worth noting that the LIMIT clause was pushed down into the query, halved the partitions scanned, and eliminated both local and remote spilling.  Finally, both queries were executed on an X-Small warehouse, demonstrating the potential cost saving of this simple change

How do you identify Spilling to Storage?

The query below will quickly identify each warehouse size and extent of spilling to storage.  As this is based upon query-level statistics, it is possible to drill down to identify specific SQL.

select warehouse_name
,      warehouse_size
,      round(avg(total_elapsed_time)/1000) as elapsed_seconds
,      round(avg(execution_time)/1000) as execution_seconds
,      count(iff(bytes_spilled_to_local_storage/1024/1024/1024 > 1,1,null)) as count_spilled_queries
,      round(sum(bytes_spilled_to_local_storage/1024/1024/1024))  as local_gb
,      round(sum(bytes_spilled_to_remote_storage/1024/1024/1024)) as remote_gb
from Snowflake.account_usage.query_history
where warehouse_size is not null
group by 1, 2
having local_gb > 1
order by six desc;

Conclusion

Clearly on Snowflake, spilling to local storage can have a significant impact upon query performance and any query that spills more than a gigabyte of storage is a candidate to run on a larger warehouse.  Typically, if a query includes significant spilling to storage is will execute twice as fast for the same cost on the next size virtual warehouse.

Spilling to remote storage impacts both performance and cost as Snowflake charges per second for virtual warehouse utilization and spilling to remote storage extends the query elapsed time.  Eliminating remote storage therefore leads to a reduced cost when the query is executed on a larger warehouse.  While this may seem a paradox, the results shown here are consistent with my personal experience.

Be aware however, that spilling to local storage is only important when it exceeds the gigabyte level, although remote spilling should always be avoided.

Notice Anything Missing?

No annoying pop-ups or adverts. No bull, just facts, insights, and opinions. Sign up below, and I will ping you a mail when new content is available. I will never spam you or abuse your trust. Alternatively, you can leave a comment below or connect on LinkedIn and Twitter, and I’ll notify you of new articles.

Disclaimer: The opinions expressed on this site are entirely my own and will not necessarily reflect those of my employer.

John A. Ryan

John A. Ryan

Previous
Previous

Snowflake Search Optimization Service Best Practices

Next
Next

Snowflake Data Loading Best Practices: Bulk COPY operations