The Differences Between Amazon Redshift, BigQuery, and Snowflake
Deciding which cloud data warehouse to use involves many qualitative and quantitative factors. These factors are dependent upon your business and use case. For the purposes of this post, we are going to focus on what we think are the ideal use cases for each data warehouse and the pricing structure of those warehouses.
For each warehouse, we’ll examine the pricing structure of that warehouse and how it compares to the others on our list. We’ll conclude by explaining what Untitled thinks are the ideal scenarios for the use of each warehouse.
Amazon Redshift has been a major player in the data warehousing space for almost a decade. However, until recently, their pricing structure remained roughly the same. Currently, there are several different pricing models depending on the volume of data in question, as well as the use cases involved.
Workloads in Redshift run on nodes of varying types. The older nodes, DC2 (dense compute) and DS2 (dense storage) bundle compute and storage together. For best performance and a lower price, if your datasets are less than 1 terabyte, these nodes, specifically DC2, are an ideal choice. The newer RA3 nodes are loaded with high-performance SSDs and separate the pricing of storage and compute. RA3s are designed for much larger volumes of data and provide high performance with a pricing structure that provides more granularity into how your budget is being used.
In addition to node types, Redshift Spectrum is a feature of Redshift that allows for exabytes of data to be queried from Amazon S3. Redshift Spectrum runs on the nodes that you are currently using, and charges $5.00 per terabyte of scanned data within S3. This allows for anyone with access to Redshift to perform SQL queries on massive amounts of data stored in different buckets within S3.
Each of the node types that Redshift offers use on-demand pricing. This allows for capacity to be paid for by the hour without commitments or upfront costs. However, to obtain a lower per hour price, you can purchase reserved capacity from AWS in 1 or 3 year increments. In either case, Redshift can be paused, and later resumed, which suspends on-demand billing. This allows for the cost-effective management of multiple Redshift instances, or for development and test purposes.
However, with cloud data warehousing becoming more popular, and the demand for data from those same warehouses increasing, throttling usage by pausing and resuming Redshift instances can quickly become extremely inefficient.
When Does Untitled Use Redshift?
At Untitled, we find the best use case for Redshift to be when there is a need for around the clock access and usage of a specific instance. Whether it’s data pipelines constantly loading data into Redshift, or data teams and applications continually querying against it, Redshift is an ideal solution for these scenarios.
Although Redshift is a very powerful data warehouse, it comes bloated with a lot of features. If you or your team wants to spend less time managing a warehouse and spend more time gaining insight from the data that’s stored in the warehouse, Redshift might not be the optimal choice. Also, if you are not using a large majority of the capacity you are paying for from Redshift, there are other alternatives available.
BigQuery is Google Cloud Platform’s offering of a petabyte scale data warehouse. Unlike Amazon Redshift, BigQuery is completely serverless, meaning that there is no management of servers and no upfront provisioning. These features are coupled with economics that allow you to only pay for what you use. This allows for BigQuery to scale to meet the needs of your workloads on its own. BigQuery also decouples storage from compute, allowing it to scale easily, as well as giving more granularity to where your costs are coming from.
There are two pricing variants for BigQuery: on-demand and flat rate pricing. On-demand pricing allows users to pay a flat fee for the amount of data that is processed per query. Underneath the hood, BigQuery bases this on the number of bytes that are read. The pricing of bytes read or processed is the same regardless of whether the data is stored in BigQuery or an external source such as Cloud Storage or Bigtable.
For customers who want more stable costs, you can purchase flat-rate pricing. When you purchase flat-rate pricing, you are no longer paying for the amount of data being processed, but are rather paying for a dedicated amount of query processing capacity. This capacity is measured in BigQuery slots. As queries are run, they consume the slot capacity you have purchased. In a situation where your capacity demands exceed the amount of capacity you have purchased, BigQuery will queue up slots, and you will not be charged any additional fees.
When Does Untitled Use BigQuery?
At Untitled, we see some of the best use cases for BigQuery coming from data analysis and batch processing. With the pay for what you use pricing model, BigQuery may be the best warehouse to have your data analysts and data scientists work in. BigQuery can be available all the time, but you only incur costs when queries are run. With consistent performance marks due to the serverless nature of the warehouse, data workers will experience snappy performance while also allowing managers to project costs by knowing the data involved in the projects they are running. BigQuery also has an extension called BigQuery Omni that enables you to have a multi-cloud analytics solution. This may be a highly valuable option if you have data that you need to query stored in AWS and Azure (which is coming soon according to GCP).
On the other hand, Untitled does not recommend BigQuery if you want to have real time, or near real time data extraction and availability from your warehouse. If you are running on the on-demand pricing structure, this can quickly increase your data warehousing bill as queries will constantly be running and reading loads of data, which in turn drives your bill higher. We also recommend that users who consistently work with BigQuery, to be conscientious of the queries that they are writing. Because BigQuery charges users by the amount of data processed, writing a lot of SELECT * queries can be inefficient and unnecessarily increase costs.
Snowflake is a newer data warehouse offering that is independent from the major cloud providers but integrates seamlessly with the cloud provider of your choice. Like BigQuery, Snowflake decouples storage from compute for many of the same reasons. However, unlike BigQuery or Redshift, Snowflake focuses it’s pricing model on the compute consumed during workloads. Instead of charging based on availability or data processed, Snowflake only charges when a server, or what they consider a warehouse, is running. After a process on Snowflake completes and there are no other processes running, the warehouse will automatically pause itself based on a predefined amount of time that elapses after the last process completes. Once a new process initiates on that warehouse, it will immediately come back online, allowing for immediate availability and low latency, at a low cost.
Again, in case you didn’t catch it above, Snowflake refers to the actual servers, or compute used, as a warehouse. So when you read about warehouse pricing or use cases below, we are talking about how each server is priced or optimally used.
The transparent pricing model of Snowflake is simple. You pay by the hour for the size of the warehouse you want to run based on the edition of your account. Snowflake accounts come in three different account editions: Standard, Enterprise, and Business Critical. While we won’t discuss the details of these editions in this blog, they do get incrementally more expensive in the order that they are listed above. Warehouse sizes in Snowflake come in T-shirt sizes S, M, XL etc. Each size has an hourly price associated with it, so it’s easy to calculate the cost of running your warehouses by knowing how long they typically are online.
Now you may be wondering, why would I use Snowflake when the rest of my cloud resources are in AWS, GCP, or Azure? Well, Snowflake integrates seamlessly with each of the major cloud providers and allows you to choose which provider you would like to have the underlying storage of your data within Snowflake to be hosted with. This allows you to take advantage of all features on your current cloud platform while benefiting from using Snowflake as your primary data warehouse.
When Does Untitled Use Snowflake?
Untitled is cloud platform agnostic and has worked with all of the major warehouses, but we think Snowflake is the most versatile when it comes to business use cases. It’s similar to BigQuery in that you are only paying for what you use, but instead of that usage being data, you are paying for compute. For example, in BigQuery, if you were to run a query that processed 2 TBs of data, that would currently cost $10. In Snowflake, if you processed the same amount of data on a small (S) warehouse and the query ran for 60 seconds, it would currently cost you somewhere between $0.03-0.06 depending on the edition of Snowflake that you have. This is because we’re only paying for the 60 seconds that we used the server to run the query, not the data we processed.
Snowflake is also similar to Redshift in that it allows for you to consistently have compute available, but will automatically pause your warehouses when you are not using them. This eliminates the need to manage your servers or make sure Redshift pause/resume schedules are updated properly. Because of these two major advantages, Snowflake is a popular choice that we encourage our clients to consider.
While we think Snowflake is a great cloud data warehouse solution, the optimal warehouse for you is entirely based upon your business use-cases and stakeholder needs. We also recognize that pricing and functionality of the warehouse should not be the sole determining factors in your data warehouse decision. However, as we have already stated, we highly recommend clients and data warehouse shoppers to consider Snowflake as we think it has the broadest range of use cases. In our next blog, we will explain in further detail why Untitled likes Snowflake so much and why it might be a great fit for your needs.