BigQuery has a built-in tool for importing CSVs, but it often struggles with larger files because they typically have data inconsistencies and format variations.
That said, once you understand what’s possible, and how you can avoid the issues by design, importing large CSVs is a breeze. Here are a few tips I’ve picked up over the years that might help.
Importing flat file data from CSVs and other supported file types from Google Cloud Storage is always the easiest approach.
You can connect to other file storage system like AWS S3, but it’s always been my experience that GCS is the simplest and least error-prone. I always recommend using it whenever possible.
How large is big? Files that are 25mb and up.
These are best handled in GCS, especially when you’re generating or scrubbing the CSVs programmatically. This file size is generally the tipping point at which I find myself switching tactics to a streaming approach rather than a direct upload.
When a file is above 25mb, you’ll need to stream rows into or out of your CSV without hitting memory issues, and GCS has a nice API for handling streams.
Using streaming has many distinct advantages beyond this though:
Keeping everything within Google Cloud has many advantages, and things always seem to run smoother and with fewer issues. You can certainly pull files in from other services like Amazon S3 or Azure Files, but using Google Cloud Storage has distinct advantages:
Once you have your CSV in Google Cloud Storage, you’ll need to set up a scheduled import that pulls the data into a BigQuery table on a recurring schedule. This can be done by creating a scheduled query.
Scheduled queries can be setup in BigQuery to directly import CSV data from a GCS bucket using the LOAD DATA
syntax.
LOAD DATA OVERWRITE `bigquery-project-id.reporting.daily-erp-export-20231208.csv`
FROM FILES (
format = 'CSV',
uris = ['gs://bigquery-etl/daily-erp-export-20231208.csv'],
field_delimiter = '|',
skip_leading_rows = 1
);
In this example, bigquery-etl
is the GCS bucket, and the CSV is using the pipe symbol as a delimiter.
Use a pipe symbol (|
) as your field_delimiter
, it results in far fewer conflicts and issues with typical CSV content. If your data is likely to contain pipe characters, choose any other character that would never be used within a string.
Pipes vs. commas are better for many reasons beyond this, it’s something I’d always recommend doing as a general tip to avoid conversion errors.
The LOAD DATA
call in BigQuery can be configured with a number of options, which are important to know and understand when working with big CSVs.
Here’s a brief description of each:
skipLeadingRows
– Skips the first N rows of the CSV filefieldDelimiter
– Specifies the delimiter used to separate fields in the CSVquote
– Specifies the character used for quoting valuesallowQuotedNewlines
– Allows quoted data sections to contain newline charactersallowJaggedRows
– Allows rows with missing trailing fields to be readnullMarker
– Specifies the string that represents a NULL value in the fileencoding
– Specifies the character encoding of the fileschema
– Defines the schema of the tablewriteDisposition
– Specifies whether to overwrite or append to the tablecreateDisposition
– Specifies whether to create the table if it doesn’t existmaxBadRecords
– Specifies the maximum number of bad records allowed before the import failsignoreUnknownValues
– Ignores extra values that are not present in the schemasourceFormat
– Specifies the format of the input dataFor more information on loading CSVs into BigQuery use Google’s docs.
The most common issue I see with CSV imports, both large and small, is with character encoding. Specifically, when CSVs contain Byte Order Mark (BOM) characters not supported by UTF-8.
When BOM characters are found, an import will always fail. To avoid the issue altogether, make sure to convert all text to UTF-8 when generating the CSV.
BigQuery has a set of naming rules that might affect your data on import. For example, any spaces in your column headers will be replaced with underscores automatically.
Don’t use any of the following prefixes in your column names, things will break:
_TABLE_
_FILE_
_PARTITION
_ROW_TIMESTAMP
__ROOT__
_COLIDENTIFIER
BigQuery has an auto-schema detection feature that is useful in some cases, but it’s important to understand how it works to avoid never-ending errors and issues that are common with larger CSV source files.
Schema auto-detection will analyze the first 500 rows of a CSV, using the data is scans, to determine an ideal data type. This means that if the first 500 rows of your data contains no null values, but subsequent rows do, you’ll end up with painful errors that are pretty difficult to track down and understand.
When auto-detection is enabled, BigQuery infers the data type for each column. BigQuery selects a random file in the data source and scans up to the first 500 rows of data to use as a representative sample. BigQuery then examines each field and attempts to assign a data type to that field based on the values in the sample. If all of the rows in a column are empty, auto-detection will default to STRING data type for the column.
In many circumstances, it’s beneficial to define your own schema, in which case I recommend using a JSON schema file.
More details on the options and methods available to LOAD DATA
are found in BigQuery’s official documentation.
Hopefully, this should provide you with some solid advice for importing large CSVs into BigQuery.
At the very least, it should provide some clarify about what BigQuery can do for your organization and analytics workflows.