Kevinleary.net: Optimizing Queries to Reduce Costs in BigQuery

Optimizing query and storage job costs in your BigQuery data warehouse can reduce your costs, and in some cases significantly if your business is working with large datasets and tables.

It’s a relatively common request among the organizations I work with as a BigQuery consultant, and I often see SQL inside of views that is less than optimal. A few small adjustments can go a long way, so I’ll outline a few of the common ones I typically see most.

Avoid Using `*` in SELECTs

BigQuery charges you based on the amount of data you scan when running a query. Avoiding SELECT * can significantly reduce costs in two ways:

Less Data Scanned: When you use SELECT *, BigQuery reads every single column from the table, even if you only need a few of them for your analysis. This can be wasteful, especially for large tables with many columns. By explicitly specifying the columns you need with SELECT column1, column2, ..., BigQuery only scans those specific columns, reducing the amount of data processed and lowering your costs.
Faster Processing: Since BigQuery doesn’t need to read unnecessary columns, your query overall can run faster. This translates to lower costs as BigQuery charges based on the amount of data scanned and the processing time required.

Data Partitioning

Related to the above approach, in general, it’s best to scan the least amount of data (or rows) necessary to retrieve the information you need. If you have a large table that’s not partitioned, you’ll likely be scanning through the entire thing on each query you run if you use a standard select. This isn’t a problem for smaller data tables, but if you have data exceeding 500k records I’d recommend partitioning your tables to break the table down into bite-sized chunks.

Imagine a huge filing cabinet with tons of folders. A BigQuery table is like that cabinet, and each folder holds data entries (rows). A partition acts like a divider within the cabinet, separating folders based on a specific characteristic. By doing this you reduce the amount of data queried significantly, particularly for larger tables with long-term data.

The most common approach to partitioning time series data is to create day-by-day partitions. Google Analytics does this for its GA4 BigQuery data connection.

If this sounds like a benefit for you I’d suggest exploring these docs, which provide specific information on partitioning:

Schemas & Choosing the Right Data Types

Data types define the type/format of the data you store in each table column. When BigQuery knows the specific kind of information to expect it’s able to process it in the best (most efficient) way possible.

All databases have a set of data types they support, so it’s important to understand BigQuery’s data types well to determine what best suits the information you’ll be storing.

This doesn’t just depend on the data itself, it also depends on the queries you’ll be making. For example, if you’re storing data in BigQuery that’s going to be used to generate a map then the GEOGRAPHY storage type is worth considering. But if you’re only looking to pinpoint certain locations it may be easier to store latitude and longitude as decimals, so FLOAT64 may be best.

To choose the right data type for a BigQuery table you should know:

Understand all the data types available to you.
Know the specific data you’ll be storing.
Have a general idea of how you’ll need to query it today, and in the future.

The following are some general recommendations on choosing the right data types, but it’s ultimately something you’ll need to decide on your own.

Data Type	Description	Use Cases
Integer	Whole numbers	Product IDs, counts, quantities
Float/Double	Numbers with decimals	Prices, measurements, scientific data
String	Text data	Names, addresses, descriptions
Boolean	True/False values	Flags (active/inactive), yes/no options
Date/Time	Specific dates and times	Order dates, timestamps, birthdates
Currency	Monetary values	Prices with specific currency symbols

Choosing the right data type has many benefits, but in terms of optimization, they’re particularly important for:

Storage Efficiency: Data types determine the amount of storage space a column/row will need. For example, an integer takes up less space than a string to store the same value. You could store integers as strings, but if you have millions of records your database will be much larger in size, and slower to query. When you use the correct data type, your database stores the data in the most compact way possible. This reduces storage space and improves overall database efficiency.
Processing Optimization: Queries handle operations differently based on the data type. When you use the correct data type, the database can perform calculations and comparisons much faster. Imagine comparing two phone numbers stored as strings versus integers. The database needs to convert the strings to numbers first for proper comparison, adding an extra step in the process.

When you use the appropriate data types for your BigQuery tables’ schema it is guaranteed to improve query efficiency and reduce processing costs.

Efficient `WHERE` Clauses

Efficient WHERE clauses play a crucial role in optimizing the performance of BigQuery SQL queries, especially when dealing with large datasets.

Reduce Data Scanned: Acts as a filter, specifying conditions that rows must meet to be included in the query results. By using an efficient WHERE clause, you significantly reduce the amount of data BigQuery or any database needs to scan. It only considers rows that match the WHERE clause criteria, significantly improving query speed.
Cost Savings: BigQuery charges based on the amount of data scanned and processing time required for your queries. Because using efficient WHERE clauses reduces data scanned it can potentially improve execution time and lead to significant cost savings, especially for large datasets.

Materialized Views

For frequently used complex queries, consider creating materialized views (pre-computed results) to avoid repetitive processing and lower query costs.

Materialized views provide the performance benefits of a table, but they’re kept up to date as underlying data changes/transforms occur. It’s ideal for a table that updates on a schedule because you don’t need an additional process to handle its refresh.

But there are some important downsides and limitations to understand when using materialized views:

Complex Queries – Not all types of complex queries can be materialized. Queries involving certain types of subqueries, joins, and aggregations may not be supported.
Data Modification Restrictions – Direct modifications to the data in the materialized view are often not allowed. Updates need to be made to the underlying base tables, and the view needs to be refreshed.
Triggers and Constraints – Materialized views might not support triggers or constraints directly, limiting the ways in which they can be used.
Incremental Refresh Limitations – Not all materialized views can be incrementally refreshed. Some might require a complete refresh, which can be resource-intensive.
Concurrency Issues – Concurrent access to materialized views can lead to performance bottlenecks, especially if the view is being refreshed frequently.
Dependency Management – Changes to the underlying base tables (such as schema changes) can impact the materialized views, necessitating re-creation or re-compilation.
Indexing and Optimization – Indexing materialized views can be challenging, and not all database systems support full optimization for materialized views.

That said, when the use case is right materialized views can lead to HUGE performance improvements, so even with the many limitations it’s still worth exploring.

Generally speaking, I’d explore using materialized views in the following cases:

Aggregated Data – Pre-computing complex aggregations like SUM, AVG, ARRAY_AGG for large datasets.
ETL Processes – Storing intermediate results of complex ETL processes.
Historical Analysis – For storing snapshots of data at specific points in time for future historical trend analysis.
Join Operations – Pre-computing and storing the results of complex JOINs between large tables to speed up query execution.
Frequent Query Patterns – Store results of frequently executed queries with repetitive computational output.

Cost Explorer

Utilize BigQuery’s built-in cost explorer tool to identify expensive queries and optimize them for better cost efficiency.

It’s not entirely clear though, because the tool doesn’t actually allow you to explore costs in terms of dollars; it provides the cost of a query in storage size (MB/GB/TB). To do this just enter a query into BigQuery and look for the size of the query in the top right of the query (look for the green check mark). Then cross-reference the query cost on the Google Cloud BigQuery pricing page.

Conclusion

Optimizing queries to reduce costs in Google BigQuery can lead to substantial savings, particularly for organizations dealing with large datasets. If you want to reduce your data warehouse costs then I’d suggest exploring what I’ve outlined here, it’s very likely you’ll be able to improve. If you need any help, consider hiring me as a BigQuery consultant to join your team.