BigQuery Best Practices

Estimated time to read: 12 minutes

Here are 20 advanced tips and tricks to simplify your work with BigQuery:

Partitioned Architecture Strategy: Partition your tables on a DATE or TIMESTAMP column to drastically reduce the amount of data scanned and improve query performance while lowering costs.

Strategic Clustering Consistency: Cluster your tables on frequently filtered or joined columns to further optimise query performance and reduce compute overhead.

Optimised Date Extraction: Use the EXTRACT function to work with date parts instead of expensive string functions, which improves both readability and execution time.

Precision Columnar Scoping: Select only the necessary columns in your queries to minimise data processing and avoid the performance penalties of SELECT *.

High-Velocity Approximate Aggregation: Use APPROX_COUNT_DISTINCT and other approximate functions for ultra-fast results when working with massive datasets where 100% precision is not the primary requirement.

Caching Persistence: BigQuery automatically caches query results for 24 hours. Rerunning identical queries within this window is free and near-instantaneous.

Common Table Expressions (CTEs): Use WITH clauses to modularise your SQL, making complex queries significantly more readable and maintainable.

Analytical Window Functions: utilise window functions (e.g., ROW_NUMBER, RANK) to perform complex analytical calculations like running totals without using self-joins.

Array Manipulation and Flattening: Employ ARRAY_AGG and UNNEST to efficiently work with nested data structures and simplify complex join logic.

Architectural Structuring: Use STRUCT to group related fields into logical objects, improving the organisation of your data schema.

Extensible UDF Integration: Create User-Defined Functions (UDFs) in JavaScript or SQL for custom business logic that is not supported by standard BigQuery functions.

Automated Query Scheduling: Schedule recurring queries to automate data pipelines, ETL processes, and reporting tasks directly within the console.

Predictive Cost Estimation (Dry Run): Use the --dry_run flag in the CLI to estimate query costs and data volume before executing, preventing unexpected expenses.

Performance and Observability Monitoring: Analyse execution plans and query history using the INFORMATION_SCHEMA.JOBS_BY_* views to identify and resolve bottlenecks.

Intermediate Result Persistence: Save complex query results into permanent destination tables to facilitate further analysis without re-computing the entire dataset.

Efficient Batch Ingestion: Load data in large batches to minimise API latency and improve the overall efficiency of your ingestion pipelines.

External Federated Querying: Use federated sources to query data directly in Google Sheets or Cloud Storage without the overhead of importing it into BigQuery.

Storage Density Optimisation: Leverage BigQuery's native columnar storage format to minimise your storage footprint and reduce costs.

Granular Access Control (IAM): Use Identity and Access Management (IAM) to enforce the principle of least privilege across datasets, tables, and views.

Real-Time Data Ingestion: Use streaming inserts for ultra-low latency data analysis, ensuring your dashboards reflect the most current state of the business.

These tips and tricks will help you simplify your work with BigQuery, optimise your queries, and get the most out of your data analysis. By incorporating these best practices into your workflow, you'll improve performance, reduce costs, and streamline your data processing tasks.

Below, I will provide you with an example code and steps to utilise some of the tips mentioned above. We'll start by creating a dataset and a few tables to play with.

Create a dataset¶

To create a dataset in BigQuery, you can use the Google Cloud Console or the bq command-line tool. Here's the command to create a dataset using the bq tool:

bq mk mydataset

Replace mydataset with the desired name for your dataset.

Create tables¶

Let's create two tables, orders and order_items. We'll partition and cluster the orders table for better performance.

Create the `orders` table¶

CREATE TABLE mydataset.orders (
  order_id INT64,
  customer_id INT64,
  order_date DATE,
  status STRING
)
PARTITION BY order_date
CLUSTER BY customer_id;

Create the `order_items` table¶

CREATE TABLE mydataset.order_items (
  order_item_id INT64,
  order_id INT64,
  product_id INT64,
  quantity INT64,
  price FLOAT64
);

Load data¶

Load sample data into the tables. You can upload your data to Google Cloud Storage and then load it into BigQuery using the bq tool or the BigQuery Console.

Example query using tips¶

Now, let's write an example query that utilises some of the tips we discussed earlier.

WITH daily_order_stats AS (
  SELECT
    order_date,
    COUNT(DISTINCT order_id) AS num_orders,
    APPROX_COUNT_DISTINCT(customer_id) AS unique_customers,
    SUM(oi.total_price) AS total_revenue
  FROM
    mydataset.orders o
  JOIN (
    SELECT
      order_id,
      SUM(quantity * price) AS total_price
    FROM
      mydataset.order_items
    GROUP BY
      order_id
  ) oi
  ON o.order_id = oi.order_id
  WHERE
    order_date BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
  GROUP BY
    order_date
),
moving_average AS (
  SELECT
    order_date,
    total_revenue,
    AVG(total_revenue) OVER (ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS revenue_moving_average
  FROM
    daily_order_stats
)
SELECT
  order_date,
  total_revenue,
  revenue_moving_average
FROM
  moving_average
ORDER BY
  order_date;

This query calculates the daily revenue and 7-day moving average for the last 30 days, using the following tips:

WITH clause for subqueries (Tip 7)
APPROX_COUNT_DISTINCT for approximate unique customer count (Tip 5)
JOIN with a subquery to calculate order totals (Tip 7)
DATE_SUB and CURRENT_DATE to filter the last 30 days (Tip 3)
Window function for the moving average calculation (Tip 8)

This example demonstrates how to apply several advanced tips and tricks when working with BigQuery. By following these guidelines, you can simplify your work, optimise query performance, and get the most out of your data analysis.

Here's a continuation of the example that demonstrates more advanced tips in BigQuery:

Create a User-Defined Function (UDF)¶

Let's create a simple JavaScript UDF to calculate the price with a discount applied:

CREATE TEMPORARY FUNCTION discounted_price(price FLOAT64, discount FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS """
  return price * (1 - discount);
""";

Query using the UDF¶

Now, let's use this UDF in a query to calculate the discounted price for each order item:

SELECT
  order_item_id,
  product_id,
  quantity,
  price,
  discounted_price(price, 0.1) AS discounted_price
FROM
  mydataset.order_items;

This query calculates the discounted price for each order item with a 10% discount, utilising the UDF we created (Tip 11).

Create a View with STRUCT and ARRAY_AGG¶

Let's create a view that aggregates the order items into an array, using STRUCT for complex data structures and ARRAY_AGG to build an array:

CREATE VIEW mydataset.orders_with_items AS
SELECT
  o.order_id,
  o.customer_id,
  o.order_date,
  o.status,
  ARRAY_AGG(STRUCT(oi.order_item_id, oi.product_id, oi.quantity, oi.price)) AS items
FROM
  mydataset.orders o
JOIN
  mydataset.order_items oi
ON o.order_id = oi.order_id
GROUP BY
  o.order_id, o.customer_id, o.order_date, o.status;

This view combines the orders and order_items tables using STRUCT and ARRAY_AGG (Tips 10 and 9).

Query the View with UNNEST¶

Now, let's query the view and use UNNEST to flatten the items array:

SELECT
  order_id,
  customer_id,
  order_date,
  status,
  item.order_item_id,
  item.product_id,
  item.quantity,
  item.price
FROM
  mydataset.orders_with_items,
  UNNEST(items) AS item;

This query uses UNNEST to work with the array data in the view (Tip 9).

Schedule a Query¶

You can schedule a recurring query in BigQuery to automate data processing and reporting tasks (Tip 12). To do this, use the BigQuery Console:

Navigate to the BigQuery Console.
Click on "Scheduled queries" in the left-hand menu.
Click "New scheduled query" and configure the query, schedule, and destination table for the query results.

Monitor Query Performance¶

Analyse your query performance using BigQuery's built-in monitoring tools and the INFORMATION_SCHEMA views (Tip 14). To view query history and performance:

Navigate to the BigQuery Console.
Click on "Query history" in the left-hand menu.
Select a query to view details such as execution time, bytes processed, and slot usage.

You can also query the INFORMATION_SCHEMA views to analyse query performance. For example, to retrieve the 10 most recent queries and their execution times:

SELECT
  query,
  start_time,
  end_time,
  TIMESTAMP_DIFF(end_time, start_time, SECOND) AS duration_seconds
FROM
  `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
  state = 'DONE'
  AND query IS NOT NULL
ORDER BY
  end_time DESC
LIMIT 10;

Replace region-us with your specific region.

By utilising these additional tips and the examples provided, you can further optimise your work with BigQuery and enhance your data analysis experience:

Dry Run Queries¶

You can use the --dry_run flag with the bq command-line tool to estimate query costs without actually running the query (Tip 13):

bq query --dry_run --nouse_legacy_sql "SELECT * FROM mydataset.orders WHERE order_date = '2023-01-01'"

This command will return the estimated bytes processed without incurring any cost or executing the query.

Control Access¶

Use Identity and Access Management (IAM) to manage access to datasets, tables, and views, ensuring the right people have the right permissions (Tip 19). To manage access, follow these steps in the Google Cloud Console:

Navigate to the BigQuery Console.
Click on the dataset or table you want to manage access for.
Click on the "Share dataset" or "Share table" button.
Add members and assign roles according to your access requirements.

Stream Data into BigQuery¶

Use streaming inserts for real-time data ingestion and analysis (Tip 20). Streaming data into BigQuery can be done using the BigQuery API, client libraries, or third-party tools. Here's an example using the Python client library:

from google.cloud import bigquery

client = bigquery.Client()
table_ref = client.dataset('mydataset').table('orders')
table = client.get_table(table_ref)

rows_to_insert = [
    {"order_id": 1, "customer_id": 123, "order_date": "2023-01-01", "status": "completed"},
    {"order_id": 2, "customer_id": 456, "order_date": "2023-01-02", "status": "processing"},
]

errors = client.insert_rows(table, rows_to_insert)
if errors == []:
    print("Rows inserted successfully.")
else:
    print(f"Encountered errors: {errors}")

Replace 'mydataset' and 'orders' with your dataset and table names. Note that streaming inserts have some limitations and costs associated with them. Please review the documentation before using this feature.

Federated Data Sources¶

Query data stored in external sources like Google Sheets or Cloud Storage without loading it into BigQuery (Tip 17). To query data stored in Google Cloud Storage, create an external table:

CREATE EXTERNAL TABLE mydataset.external_orders_csv (
  order_id INT64,
  customer_id INT64,
  order_date DATE,
  status STRING
)
OPTIONS (
  FORMAT = 'CSV',
  URIs = ['gs://mybucket/orders.csv']
);

Now, you can query the external table just like any other BigQuery table:

SELECT * FROM mydataset.external_orders_csv;

These additional examples demonstrate how you can make use of advanced tips and tricks in BigQuery to simplify your work, optimise query performance, and effectively analyse your data. By incorporating these best practices into your workflow, you'll improve performance, reduce costs, and streamline your data processing tasks.

We have already covered the majority of the 20 tips and tricks for BigQuery. However, I will provide a few more examples related to the remaining tips that haven't been fully explored:

Storage Density Strategy: Leverage BigQuery's columnar storage format to minimise your storage footprint and reduce costs. When loading data, prioritise compressed formats (Avro or Parquet) and ensure that column data types are precisely defined (e.g., using `INT64` instead of `STRING` for numeric values) to maximise compression efficiency.¶

**Columnar Scoping (SELECT *)**: Minimise data processing costs by selecting only the necessary columns. Avoiding `SELECT *` significantly reduces the amount of data read from disk, improving query velocity across massive datasets.¶

Efficient Bulk Loading: Load data in large batches to reduce API call overhead and improve ingestion reliability. The `bq` command-line tool allows for simultaneous loading of multiple files using glob patterns.¶

Strategic Result Persistence: Save intermediate query results into permanent destination tables to facilitate iterative analysis without re-computing expensive joins or aggregations.¶

By implementing these additional tips and tricks, you can make your work with BigQuery more efficient and productive. By optimising your data storage, selecting only necessary columns, loading data in batches, and storing query results in destination tables, you'll improve performance, reduce costs, and streamline your data processing tasks.

As we have covered most of the advanced tips and tricks for working with BigQuery, I will now provide some general advice to further improve your experience with the platform.

Automated Result Caching: Take advantage of BigQuery's automatic caching to eliminate redundant compute costs for frequently executed reports and dashboards.¶

To force BigQuery to bypass the cache and execute the query, you can use the bq command-line tool:

bq query --nouse_legacy_sql --nocache "SELECT * FROM mydataset.orders WHERE order_date >= '2023-01-01'"

Alternatively, you can disable caching for a specific query by unchecking the "Use cached results" option in the BigQuery Console.

BigQuery Consumption Monitoring: Proactively track usage and costs by configuring alerts within Google Cloud Monitoring. Monitoring slot usage and byte consumption ensures that your data pipelines remain within budget.¶

Partition Pruning and Adaptive Clustering: Use partitioning and clustering to minimise the scan volume. This architectural pattern ensures that BigQuery only processes relevant data segments, leading to near-instantaneous query performance on multi-terabyte datasets.¶

Incorporating these additional best practices into your BigQuery workflow'll improve performance, reduce costs, and streamline your data processing tasks. Continuously explore new features and improvements BigQuery offers to stay up-to-date with the platform's capabilities, and always look for opportunities to optimise your work.

As we have already covered numerous advanced tips and tricks for BigQuery, let's focus on some general strategies to stay informed about new features, best practices, and case studies related to BigQuery and other Google Cloud services.

Continuous Release Monitoring: Regularly review BigQuery release notes to identify new performance optimisations and cost-saving features as they are deployed.¶

Community Intelligence Gathering: Follow primary Google Cloud communication channels to gain insights into emerging best practices and real-world implementation strategies.¶

Professional Enablement Sessions: Participate in webinars and architectural workshops to learn from BigQuery engineers and industry peers.¶

Strategic Case Study Analysis: Study organizational case studies to understand how industry leaders overcome scalability and cost challenges using BigQuery.¶

Implementing these strategies will help you stay informed about the latest developments in BigQuery, improve your skills, and adapt to new features and best practices. Continuously learning and engaging with the BigQuery community will enable you to get the most out of the platform and enhance your data analysis capabilities.