Optimizing Query performance in BigQuery presents large cost saving opportunities for the Engineering Team as well as a speed boost which benefits the business, stakeholders and developers alike.
Following on from our meeting with the Google team, I was inspired to do some further learning and challenge myself to write a short blog post with my top 5 simple tips for writing better queries.
This is by no means an exhaustive list, or even the best tips. My hope is that this blog post will inspire you (and me) to do further research and practice.
Rather than provide example code which is useful but boring, I've gone with pictures instead which are less useful but also way less boring.
I will also provide a couple links at the end of the post for those who want to learn more.
Tip 1. Avoid SELECT *
OK I admit it, I have been guilty of using SELECT * for massive tables believing that applying the LIMIT clause will reduce bytes processes. I was wrong.
Best Practise
- Want to explore the data? Use the data preview function instead of SELECT *
Query only the columns you need.
- eg SELECT dim1, metric1 FROM … or
- SELECT * EXCEPT dim2, metric2 FROM
If you do require every column, consider:
- Materialising results in a destination table and querying that table instead.
- Partitioning your tables and querying the relevant partition.
Tip 2. Table JOIN Order Matters
The aim here is to reduce the dataset as early as possible. I thought that BQ optimized joins automatically, but apparently this is not always the case.
Best Practise
- Manually place the largest table first, followed by the smallest, then by decreasing size.
- You want to reduce your resultset the most at the first join.
Tip 3. Know when to aggregate early and late.
Aggregation is very costly so care needs to be taken.
Best Practise
- As a general rule aggregate and perform complex operations as late and as seldom as possible.
- UNLESS a CTE (Common Table Expression) can be reduced drastically by aggregating before joining, then aggregate early! You want to prune data as much as possible before a join.
Tip 4. When using WHERE clauses AND…, put the most selective first.
Again, the point is to reduce the data as early as possible.
Best Practise
- Order expressions in your WHERE clause with the most selective expression first. (i.e. the one which will eliminate the most rows)
- Then: Use expensive expressions like LIKE ‘%text%’ only after you have reduced your dataset
Tip 5. Use Approximate Functions where possible.
I had no idea these even existed. If you don’t need precision >99% approximate functions yield faster query performance.
Best Practise
- Where appropriate, use approximate functions. (e.g. in data exploration)
- Eg use APPROX_COUNT_DISTINCT(dim1) instead of COUNT (DISTINCT dim1)
Tip 5. Use Approximate Functions where possible.
I had no idea these even existed. If you don’t need precision >99% approximate functions yield faster query performance.
Best Practise
- Where appropriate, use approximate functions. (e.g. in data exploration)
- Eg use APPROX_COUNT_DISTINCT(dim1) instead of COUNT (DISTINCT dim1)