How to create dummy variable columns for thousands of categories in Google BigQuery?

Question

I have a simple table with 2 columns: UserID and Category, and each UserID can repeat with a few categories, like so:

UserID   Category
------   --------
1         A
1         B
2         C
3         A
3         C
3         B

I want to "dummify" this table: i.e. to create an output table that has a unique column for each Category consisting of dummy variables (0/1 depending on whether the UserID belongs to that particular Category):

UserID    A  B  C
------    -- -- --
1         1  1  0
2         0  0  1
3         1  1  1

My problem is that I have THOUSANDS of categories (not just 3 as in this example) and so this cannot be efficiently accomplished using CASE WHEN statement.

So my questions are:

1) Is there a way to "dummify" the Category column in Google BigQuery without using thousands of CASE WHEN statements.

2) Is this a situation where the UDF functionality works well? It seems like it would be the case but I am not familiar enough with UDF in BigQuery to solve this problem. Would someone be able to help out?

Thanks.

It might not be a good idea to get thousands of columns in single BigQuery table. The hard limit is 10,000 columns, but it's best to to get to thousands. What is the original problem you are solving ? — Mosha Pasumansky, Dec 01 '15 at 00:05
@MoshaPasumansky I am fine with 10,000 categories. I'm dealing with demographic categories here (~5000 unique categories). I want/need to do it in BigQuery because if I were to dummify (after downloading the table from BigQuery) in Python/R and I have millions of rows (UserID), then it would either take an intolerably long time or crashes. So would you happen to know how to solve this in BigQuery? — wubr2000, Dec 01 '15 at 00:16

score 10 · Answer 1 · answered Dec 01 '15 at 00:55

10

You can use below "technic"

First run query #1. It produces the query (query #2) that you need to run to get result you need. Please, still consider Mosha's comments before going "wild" with thousands categories :o)

Query #1:

SELECT 'select UserID, ' + 
   GROUP_CONCAT_UNQUOTED(
    'sum(if(category = "' + STRING(category) + '", 1, 0)) as ' + STRING(category)
   ) 
   + ' from YourTable group by UserID'
FROM (
  SELECT category 
  FROM YourTable  
  GROUP BY category
)

Resulted will be like below - Query #2

SELECT
  UserID,
  SUM(IF(category = "A", 1, 0)) AS A,
  SUM(IF(category = "B", 1, 0)) AS B,
  SUM(IF(category = "C", 1, 0)) AS C
FROM
  YourTable
GROUP BY
  UserID

of course for three categories - you could do it manually, but for thousands it will definitelly will make day for you!!

Result of query #2 will looks as you expect:

UserID  A   B   C    
1       1   1   0    
2       0   0   1    
3       1   1   1

answered Dec 01 '15 at 00:55

Mikhail Berlyant

165,386
8
154
230

2

This is exactly what I'm trying to avoid - having to write thousands of these lines - either CASE WHEN or SUM IF! – wubr2000 Dec 01 '15 at 01:05
2

read again my answer - you DO NOT need to write it manually - just run QUERY #1 and it will produce it for you! – Mikhail Berlyant Dec 01 '15 at 01:06
Sorry about that! Let me see if I can use this to make it work. @MikhailBerlyant. – wubr2000 Dec 01 '15 at 01:09
Sorry for the misunderstanding earlier. I see what you are doing here. Quite clever. But basically I'd still need to paste in a query (result from your query #1) with thousands of lines. I was hoping for a more "elegant"/efficient way so that it can be done in a single query (with UDFs for example) without having to paste in thousands of lines which can be easier for someone else to read and reproduce. @MikhailBerlyant – wubr2000 Dec 01 '15 at 01:38
i doubt there is a single query approach for this particular one and still think that in your case above approach is relatively clean and simple for anyone to use - run-copy-paste-run. it does exactly what you asked for in your question :o) – Mikhail Berlyant Dec 01 '15 at 01:42
Thanks again for your answer Mikhail. Again, it's quite clever. I hazard a guess that with UDF, it might be possible to do dummify using a single query. But I'm still waiting for someone who knows UDF in BigQuery to either confirm or disabuse me of this notion. @MikhailBerlyant. – wubr2000 Dec 01 '15 at 02:34
Sure. I am using and quite familiar with BigQuery UDF so will try also to check it again for this task. tomorrow :o) – Mikhail Berlyant Dec 01 '15 at 02:37
Thanks alot Mikhail!. @MikhailBerylant. – wubr2000 Dec 01 '15 at 02:41
@wubr2000 - from your comment in http://stackoverflow.com/questions/34798244/how-to-transpose-rows-to-columns-with-large-amount-of-the-data-in-bigquery-sql I have realised you were not able to implement above suggestion because of "Resources exceeded during query execution" error. Please see http://stackoverflow.com/questions/34845697/how-to-scale-pivoting-in-bigquery for my detailed recommendations – Mikhail Berlyant Jan 18 '16 at 01:29
Hi Mikhail Berlyant, in your example, the value of each entry is either 1 or 0, in my case, I have to convert a list of weighted edges (user-item rating) into a adjacency matrix, can you give me an example of how to do that? – Charles Chow Jun 15 '16 at 01:05
Hi @CharlesChow, this is actualy not my example but rather example that is presented in question. I would recommend you to post your specific question along with your example and I (or someone else) will try to answer it – Mikhail Berlyant Jun 15 '16 at 01:11
Hi Mikhail, I posted my question here: https://stackoverflow.com/questions/37825142/build-adjacency-matrix-from-list-of-weighted-edges-in-bigquery thanks – Charles Chow Jun 15 '16 at 02:32
Is this deprecated in Bigquery now? I can't use + between strings. – Arel Lin Dec 02 '19 at 05:57
@ArelLin - this answer was for BigQuery Legacy SQL. For Standard SQL you should use CONCAT() – Mikhail Berlyant Dec 02 '19 at 14:18
1

@MikhailBerlyant For StandardSQL, what do you mean by you should use CONCAT(), is it as simple as replacing GROUP_CONCAT_UNQUOTED with CONCAT and replacing all the + with a comma, and removing STRING() functions. Would be great if you could update for StandardSQL or at least clarify this. – Jas Jan 09 '21 at 09:47

How to create dummy variable columns for thousands of categories in Google BigQuery?

1 Answers1

Linked