0

I have asked a previous question about doing a multi-level aggregation query on the X-axis here: Get the top patent countries, codes in a BQ public dataset.

Here is how the query (copied from the accepted answer works) to get:

Top 2 Countries by Count, and within those countries, top 2 Codes by Count

WITH A AS (
  SELECT country_code
  FROM `patents-public-data.patents.publications`
  GROUP BY country_code
  ORDER BY COUNT(1) DESC
  LIMIT 2
), B AS (
  SELECT
    country_code,
    application_kind,
    COUNT(1) application_kind_count
  FROM `patents-public-data.patents.publications`
  WHERE country_code IN (SELECT country_code FROM A)
  GROUP BY country_code, application_kind
), C AS (
  SELECT
    country_code,
    application_kind,
    application_kind_count,
    DENSE_RANK() OVER(PARTITION BY country_code ORDER BY application_kind_count DESC) AS application_kind_rank
  FROM B
)
SELECT
  country_code,
  application_kind,
  application_kind_count
FROM C
WHERE application_kind_rank <= 2  

And I get something like:

country_code     application_kind   count
JP               A                  125
JP               U                  124
CN               A                  118
CN               U                  101

Now I would like to add the following pivot on the y-axis: to get the following:

  • X: Top 2 Countries by Count, and within those countries, top 2 Codes by Count
  • Y: Top 2 family_id by Count, Top 2 priority_date by Count

The final results would then look like:

enter image description here

I am able to build the Y-query in a second query --

WITH A AS (
  SELECT family_id
  FROM `patents-public-data.patents.publications`
  GROUP BY family_id
  ORDER BY COUNT(1) DESC
  LIMIT 2
), B AS (
  SELECT
    family_id,
    priority_date,
    COUNT(1) priority_date_count
  FROM `patents-public-data.patents.publications`
  WHERE family_id IN (SELECT family_id FROM A)
  GROUP BY family_id, priority_date
), C AS (
  SELECT
    family_id,
    priority_date,
    priority_date_count,
    DENSE_RANK() OVER(PARTITION BY family_id ORDER BY priority_date_count DESC) AS priority_date_rank
  FROM B
)
SELECT
  family_id,
  priority_date,
  priority_date_count
FROM C
WHERE priority_date_rank <= 2 

However, I am not quite sure how to merge them together, in a single query or in two.

halfer
  • 19,824
  • 17
  • 99
  • 186
David542
  • 104,438
  • 178
  • 489
  • 842
  • so it will be sort of intersection of X and Y - correct? if so - should be simple case – Mikhail Berlyant May 30 '19 at 01:58
  • @MikhailBerlyant yes, that's correct, without knowing the pivot values beforehand (i.e., you have to do an aggregation query to get the "TOP" values) – David542 May 30 '19 at 02:04

1 Answers1

0

Below is for BigQuery Standard SQL and is just demo of the approach and not pretending to be 100% representing requested logic

WITH A_X AS (
  SELECT country_code FROM `patents-public-data.patents.publications`
  GROUP BY country_code ORDER BY COUNT(1) DESC LIMIT 2
), B_X AS (
  SELECT country_code, application_kind, COUNT(1) application_kind_count
  FROM `patents-public-data.patents.publications` WHERE country_code IN (SELECT country_code FROM A_X)
  GROUP BY country_code, application_kind
), C_X AS (
  SELECT country_code, application_kind, application_kind_count,
    DENSE_RANK() OVER(PARTITION BY country_code ORDER BY application_kind_count DESC) AS application_kind_rank
  FROM B_X
), X AS (
  SELECT country_code, application_kind, application_kind_count
  FROM C_X WHERE application_kind_rank <= 2  
), A_Y AS (
  SELECT family_id FROM `patents-public-data.patents.publications` 
  JOIN X USING(country_code, application_kind)
  GROUP BY family_id 
  ORDER BY COUNT(1) DESC LIMIT 2
), B_Y AS (
  SELECT family_id, priority_date, COUNT(1) priority_date_count
  FROM `patents-public-data.patents.publications` WHERE family_id IN (SELECT family_id FROM A_Y) 
  GROUP BY family_id, priority_date
), C_Y AS (
  SELECT family_id, priority_date, priority_date_count,
    DENSE_RANK() OVER(PARTITION BY family_id ORDER BY priority_date_count DESC) AS pos_date
  FROM B_Y
), Y AS (
  SELECT family_id, priority_date, pos_date, DENSE_RANK() OVER(ORDER BY family_id) pos_family
  FROM C_Y WHERE pos_date <= 2 
)
SELECT country_code, application_kind,
  COUNTIF(pos_family = 1 AND pos_date = 1) `family1_date1`,
  COUNTIF(pos_family = 1 AND pos_date = 2) `family1_date2`,
  COUNTIF(pos_family = 2 AND pos_date = 1) `family2_date1`,
  COUNTIF(pos_family = 2 AND pos_date = 2) `family2_date2`
FROM `patents-public-data.patents.publications` 
JOIN Y USING(family_id, priority_date)
WHERE country_code IN (SELECT country_code FROM X)
AND application_kind IN (SELECT application_kind FROM x) 
GROUP BY country_code, application_kind 

the result is

enter image description here

Obviously, there are number of zeroes above because of intersection logic

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • looks good. One alteration, however, instead of it saying "family1_date1", could we have it be the actual "family1" value and "date1" value? I can go ahead and award the bounty afterwards. Nice work! – David542 Jun 01 '19 at 02:36
  • This is possible only with dynamic SQL where in first step you generate whole query with needed aliases in output and then in second step you actually execute such generated query. There are many examples of such approach here on SO. You can check my recent answers - I remember I had similar answer day or fea days ago – Mikhail Berlyant Jun 01 '19 at 02:41
  • I see, maybe just post a link to that answer and show very briefly how the query would be different if we needed to get the dynamic column names, and then I'll accept the answer? – David542 Jun 02 '19 at 00:12
  • @David542 - this is one of the latest [answer](https://stackoverflow.com/a/55861331/5221944) - there is link there to yet another previous answer – Mikhail Berlyant Jun 04 '19 at 00:42