Partitioning versus Clustering in BigQuery
Both Clustering and Partitioning are used to improve storage layouts and hence, overall query performance.
BigQuery - Partitioning versus Clustering
High Cardinality (> 1000) --> Use Clustering
Low Cardinality (<1000) --> Use Partitioning
Cardinality refers to the number of unique values for a column. (e.g. Months of the year - cardinality 12)
Prefer clustering over partitioning under the following circumstances:
- Partitioning results in a small amount of data per partition (approximately less than 1 GB).
- Partitioning results in a large number of partitions beyond the limits on partitioned tables.
- Partitioning results in your mutation operations modifying most partitions in the table frequently (for example, every few minutes).
Combine the two
You can also combine partitioning with clustering. Data is first partitioned and then data in each partition is clustered by the clustering columns.
When the table is queried, partitioning sets an upper bound of the query cost based on partition pruning.
There might be other query cost savings when the query actually runs, because of cluster pruning.
Leave a Reply