3

I have 32 years of data that I want to put into a partitioned table. However BigQuery says that I'm going over the limit (4000 partitions).

For a query like:

CREATE TABLE `deleting.day_partition`
PARTITION BY FlightDate 
AS 
SELECT *
FROM `flights.original` 

I'm getting an error like:

Too many partitions produced by query, allowed 2000, query produces at least 11384 partitions

How can I get over this limit?

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325

2 Answers2

13

Instead of partitioning by day, you could partition by week/month/year.

In my case each year of data contains around ~3GB of data, so I'll get the most benefits from clustering if I partition by year.

For this, I'll create a year date column, and partition by it:

CREATE TABLE `fh-bigquery.flights.ontime_201903`
PARTITION BY FlightDate_year
CLUSTER BY Origin, Dest 
AS
SELECT *, DATE_TRUNC(FlightDate, YEAR) FlightDate_year
FROM `fh-bigquery.flights.raw_load_fixed`

Note that I created the extra column DATE_TRUNC(FlightDate, YEAR) AS FlightDate_year in the process.

Table stats:

enter image description here

Since the table is clustered, I'll get the benefits of partitioning even if I don't use the partitioning column (year) as a filter:

SELECT *
FROM `fh-bigquery.flights.ontime_201903`
WHERE FlightDate BETWEEN '2008-01-01' AND '2008-01-10'

Predicted cost: 83.4 GB
Actual cost: 3.2 GB
Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • Too bad it doesn't help when filtering on a flight date or range e.g. drill-down from BI tool, as partition pruning is not supported on DATE_TRUNC() - see https://issuetracker.google.com/issues/139939454 – yan-hic Sep 10 '19 at 12:07
  • Would you write a new stack overflow question for this? – Felipe Hoffa Sep 10 '19 at 13:34
  • This does not seem to work anymore in 2022. When I tried to do the same, it did not allow me to partition by a new extra column (similar to FlightDate_year) that I was adding part of the new table creation. It kept asking for DATE_TRUNC to be after PARTITION BY – Avi Jun 17 '22 at 09:56
2

Alternative example, I created a NOAA GSOD summary table clustered by station name - and instead of partitioning by day, I didn't partition it at all.

Let's say I want to find the hottest days since 1980 for all stations with a name like SAN FRAN%:

SELECT name, state, ARRAY_AGG(STRUCT(date,temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(date) active_until
FROM `fh-bigquery.weather_gsod.all` 
WHERE name LIKE 'SAN FRANC%'
AND date > '1980-01-01'
GROUP BY 1,2
ORDER BY active_until DESC

enter image description here

Note that I got the results after processing only 55.2MB of data.

The equivalent query on the source tables (without clustering) processes 4GB instead:

# query on non-clustered tables - too much data compared to the other one
SELECT name, state, ARRAY_AGG(STRUCT(CONCAT(a.year,a.mo,a.da),temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(CONCAT(a.year,a.mo,a.da)) active_until
FROM `bigquery-public-data.noaa_gsod.gsod*` a
JOIN `bigquery-public-data.noaa_gsod.stations`  b
ON a.wban=b.wban AND a.stn=b.usaf
WHERE name LIKE 'SAN FRANC%'
AND _table_suffix >= '1980'
GROUP BY 1,2
ORDER BY active_until DESC

I also added a geo clustered table, to search by location instead of station name. See details here: https://stackoverflow.com/a/34804655/132438

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325