Importing Large CSVs into  BigQuery

Tips on cleaning and importing large CSVs into BigQuery

BigQuery has a built-in tool for importing CSVs, but it often struggles with larger files because they typically have data inconsistencies and format variations.

That said, once you understand what’s possible, and how you can avoid the issues by design, importing large CSVs is a breeze. Here are a few tips I’ve picked up over the years that might help.

Large CSV Import into BigQuery from Google Cloud Storage

Using Google Cloud Storage

Importing flat file data from CSVs and other supported file types from Google Cloud Storage is always the easiest approach.

You can connect to other file storage system like AWS S3, but it’s always been my experience that GCS is the simplest and least error-prone. I always recommend using it whenever possible.

What’s considered large?

How large is big? Files that are 25mb and up.

These are best handled in GCS, especially when you’re generating or scrubbing the CSVs programmatically. This file size is generally the tipping point at which I find myself switching tactics to a streaming approach rather than a direct upload.

Streaming

When a file is above 25mb, you’ll need to stream rows into or out of your CSV without hitting memory issues, and GCS has a nice API for handling streams.

Using streaming has many distinct advantages beyond this though:

  1. Saves Memory: Only small chunks of data are handled at a time, so you don’t overload your system
  2. Faster Output: You can start writing the file right away instead of waiting for all data to be ready
  3. Prevents Crashes: Processing smaller parts at a time reduces the risk of running out of memory
  4. Scales Easily: It works for any file size without needing special tweaks

Google vs. AWS/Azure/etc.

Keeping everything within Google Cloud has many advantages, and things always seem to run smoother and with fewer issues. You can certainly pull files in from other services like Amazon S3 or Azure Files, but using Google Cloud Storage has distinct advantages:

  1. Better Integration: GCS works seamlessly with BigQuery since both are Google Cloud services, reducing setup time and complexity
  2. No Transfer Fees: Moving data between GCS and BigQuery doesn’t cost extra, unlike transferring data from S3, which incurs fees
  3. Faster Access: Data transfers between GCS and BigQuery are faster due to both being on the same infrastructure
  4. Unified Security: Manage security with a single IAM system across both BigQuery and GCS, instead of juggling AWS and Google Cloud permissions
  5. Simpler Workflows: Automation and workflows are easier with GCS and BigQuery since they use the same tools and APIs
  6. Automatic Encryption: GCS automatically encrypts data, so you don’t have to worry about extra

Automated Scheduled Imports

Once you have your CSV in Google Cloud Storage, you’ll need to set up a scheduled import that pulls the data into a BigQuery table on a recurring schedule. This can be done by creating a scheduled query.

Scheduled queries can be setup in BigQuery to directly import CSV data from a GCS bucket using the LOAD DATA syntax.

LOAD DATA OVERWRITE `bigquery-project-id.reporting.daily-erp-export-20231208.csv`
FROM FILES (
format = 'CSV',
uris = ['gs://bigquery-etl/daily-erp-export-20231208.csv'],
field_delimiter = '|',
skip_leading_rows = 1
);

In this example, bigquery-etl is the GCS bucket, and the CSV is using the pipe symbol as a delimiter.

Using Pipe Delimiters

Use a pipe symbol (|) as your field_delimiter, it results in far fewer conflicts and issues with typical CSV content. If your data is likely to contain pipe characters, choose any other character that would never be used within a string.

Pipes vs. commas are better for many reasons beyond this, it’s something I’d always recommend doing as a general tip to avoid conversion errors.

Load Data Configuration Options

The LOAD DATA call in BigQuery can be configured with a number of options, which are important to know and understand when working with big CSVs.

Here’s a brief description of each:

For more information on loading CSVs into BigQuery use Google’s docs.

Character Encoding Issues

The most common issue I see with CSV imports, both large and small, is with character encoding. Specifically, when CSVs contain Byte Order Mark (BOM) characters not supported by UTF-8.

When BOM characters are found, an import will always fail. To avoid the issue altogether, make sure to convert all text to UTF-8 when generating the CSV.

Adhere to the Hidden Rules

BigQuery has a set of naming rules that might affect your data on import. For example, any spaces in your column headers will be replaced with underscores automatically.

Don’t use any of the following prefixes in your column names, things will break:

Schema Auto-detection

BigQuery has an auto-schema detection feature that is useful in some cases, but it’s important to understand how it works to avoid never-ending errors and issues that are common with larger CSV source files.

Schema auto-detection will analyze the first 500 rows of a CSV, using the data is scans, to determine an ideal data type. This means that if the first 500 rows of your data contains no null values, but subsequent rows do, you’ll end up with painful errors that are pretty difficult to track down and understand.

When auto-detection is enabled, BigQuery infers the data type for each column. BigQuery selects a random file in the data source and scans up to the first 500 rows of data to use as a representative sample. BigQuery then examines each field and attempts to assign a data type to that field based on the values in the sample. If all of the rows in a column are empty, auto-detection will default to STRING data type for the column.

In many circumstances, it’s beneficial to define your own schema, in which case I recommend using a JSON schema file.

More details on the options and methods available to LOAD DATA are found in BigQuery’s official documentation.

Conclusion

Hopefully, this should provide you with some solid advice for importing large CSVs into BigQuery.

At the very least, it should provide some clarify about what BigQuery can do for your organization and analytics workflows.