Optimizing query and storage job costs in your BigQuery data warehouse can reduce your costs, and in some cases significantly if your business is working with large datasets and tables.
It’s a relatively common request among the organizations I work with as a BigQuery consultant, and I often see SQL inside of views that is less than optimal. A few small adjustments can go a long way, so I’ll outline a few of the common ones I typically see most.
*
in SELECTsBigQuery charges you based on the amount of data you scan when running a query. Avoiding SELECT *
can significantly reduce costs in two ways:
SELECT *
, BigQuery reads every single column from the table, even if you only need a few of them for your analysis. This can be wasteful, especially for large tables with many columns. By explicitly specifying the columns you need with SELECT column1, column2, ...
, BigQuery only scans those specific columns, reducing the amount of data processed and lowering your costs.Related to the above approach, in general, it’s best to scan the least amount of data (or rows) necessary to retrieve the information you need. If you have a large table that’s not partitioned, you’ll likely be scanning through the entire thing on each query you run if you use a standard select. This isn’t a problem for smaller data tables, but if you have data exceeding 500k records I’d recommend partitioning your tables to break the table down into bite-sized chunks.
Imagine a huge filing cabinet with tons of folders. A BigQuery table is like that cabinet, and each folder holds data entries (rows). A partition acts like a divider within the cabinet, separating folders based on a specific characteristic. By doing this you reduce the amount of data queried significantly, particularly for larger tables with long-term data.
The most common approach to partitioning time series data is to create day-by-day partitions. Google Analytics does this for its GA4 BigQuery data connection.
If this sounds like a benefit for you I’d suggest exploring these docs, which provide specific information on partitioning:
Data types define the type/format of the data you store in each table column. When BigQuery knows the specific kind of information to expect it’s able to process it in the best (most efficient) way possible.
All databases have a set of data types they support, so it’s important to understand BigQuery’s data types well to determine what best suits the information you’ll be storing.
This doesn’t just depend on the data itself, it also depends on the queries you’ll be making. For example, if you’re storing data in BigQuery that’s going to be used to generate a map then the GEOGRAPHY
storage type is worth considering. But if you’re only looking to pinpoint certain locations it may be easier to store latitude and longitude as decimals, so FLOAT64
may be best.
To choose the right data type for a BigQuery table you should know:
The following are some general recommendations on choosing the right data types, but it’s ultimately something you’ll need to decide on your own.
Data Type | Description | Use Cases |
---|---|---|
Integer | Whole numbers | Product IDs, counts, quantities |
Float/Double | Numbers with decimals | Prices, measurements, scientific data |
String | Text data | Names, addresses, descriptions |
Boolean | True/False values | Flags (active/inactive), yes/no options |
Date/Time | Specific dates and times | Order dates, timestamps, birthdates |
Currency | Monetary values | Prices with specific currency symbols |
Choosing the right data type has many benefits, but in terms of optimization, they’re particularly important for:
When you use the appropriate data types for your BigQuery tables’ schema it is guaranteed to improve query efficiency and reduce processing costs.
WHERE
ClausesEfficient WHERE
clauses play a crucial role in optimizing the performance of BigQuery SQL queries, especially when dealing with large datasets.
WHERE
clause, you significantly reduce the amount of data BigQuery or any database needs to scan. It only considers rows that match the WHERE
clause criteria, significantly improving query speed.WHERE
clauses reduces data scanned it can potentially improve execution time and lead to significant cost savings, especially for large datasets.For frequently used complex queries, consider creating materialized views (pre-computed results) to avoid repetitive processing and lower query costs.
Materialized views provide the performance benefits of a table, but they’re kept up to date as underlying data changes/transforms occur. It’s ideal for a table that updates on a schedule because you don’t need an additional process to handle its refresh.
But there are some important downsides and limitations to understand when using materialized views:
That said, when the use case is right materialized views can lead to HUGE performance improvements, so even with the many limitations it’s still worth exploring.
Generally speaking, I’d explore using materialized views in the following cases:
SUM
, AVG
, ARRAY_AGG
for large datasets.JOIN
s between large tables to speed up query execution.Utilize BigQuery’s built-in cost explorer tool to identify expensive queries and optimize them for better cost efficiency.
It’s not entirely clear though, because the tool doesn’t actually allow you to explore costs in terms of dollars; it provides the cost of a query in storage size (MB/GB/TB). To do this just enter a query into BigQuery and look for the size of the query in the top right of the query (look for the green check mark). Then cross-reference the query cost on the Google Cloud BigQuery pricing page.
Optimizing queries to reduce costs in Google BigQuery can lead to substantial savings, particularly for organizations dealing with large datasets. If you want to reduce your data warehouse costs then I’d suggest exploring what I’ve outlined here, it’s very likely you’ll be able to improve. If you need any help, consider hiring me as a BigQuery consultant to join your team.