Shuffle column in Google's BigQuery based on groupby

Question

I want to randomly shuffle the values for one single column of a table based on a groupby. E.g., I have two columns A and B. Now, I want to randomly shuffle column B based on a groupby on A.

For an example, suppose that there are three distinct values in A. Now for each distinct value of A, I want to shuffle the values in B, but just with values having the same A.

Example input:

A       B       C          
-------------------
1       1       x   
1       3       a
2       4       c
3       6       d
1       2       a
3       5       v

Example output:

A       B       C         
------------------
1       3       x   
1       2       a
2       4       c
3       6       d
1       1       a
3       5       v

In this case, for A=1 the values for B got shuffled. The same happened for A=2, but as there is only one row it stayed like it was. For A=3 by chance the values for B also stayed like they were. The values for column C stay as they are.

Maybe this can be solved by using window functions, but I am unsure how exactly.

As a side note: This should be achieved in Google's BigQuery.

The problem is that it should be a shuffle, so a sampling without replacement. The random ordering would be random for each iteration though; or am I wrong here? — fsociety, Sep 21 '15 at 18:24
I forgot to mention that I am using Google's BigQuery. As I operate on a lot of data and do some direct follow-up queries on the shuffled data, I want to directly do it in the db. — fsociety, Sep 21 '15 at 18:38
So, which is it? Oracle or Google BigQuery? You have both tags, and your title says Oracle. — sstan, Sep 21 '15 at 19:09
Well Google BigQuery uses Oracle based syntax, anyways someone removed the oracle tag anyways already. — fsociety, Sep 21 '15 at 19:15
I've never heard of Google BigQuery having Oracle-based syntax. It is based on SQL though. — sstan, Sep 21 '15 at 19:19

Ditto · Accepted Answer · 2015-09-21T18:51:06.337

2

Is this what you're after ? (you tagged with both Mysql and Oracle .. so I answer here using Oracle) [edit] corrected based on confirmed logic [/edit]

  with w_data as (
        select 1 a, 1 b from dual union all
        select 1 a, 3 b from dual union all
        select 2 a, 4 b from dual union all
        select 3 a, 6 b from dual union all
        select 1 a, 2 b from dual union all
        select 3 a, 5 b from dual
     ),
     w_suba as (
        select a, row_number() over (partition by a order by dbms_random.value) aid
          from w_data
        ),
     w_subb as (
        select a, b, row_number() over (partition by a order by dbms_random.value) bid
          from w_data
        )
  select sa.a, sb.b
    from w_suba  sa,
         w_subb  sb
   where sa.aid = sb.bid
     and sa.a = sb.a
  /


           A          B
  ---------- ----------
           1          3
           1          1
           1          2
           2          4
           3          6
           3          5

  6 rows selected.

  SQL> /

           A          B
  ---------- ----------
           1          3
           1          1
           1          2
           2          4
           3          5
           3          6

  6 rows selected.

  SQL>

Logic breakdown:

1) w_data is just your sample data set ...

2) randomize column a (not really needed, you could just rownum this, and let b randomize ... but I do so love (over)using dbms_random :) heh )

3) randomize column b - (using partition by analytic creates "groups" .. order by random radomizes the items within each group)

4) join them ... using both the group (a) and the randomized id to find a random item within each group.

by doing the randomize this way you can ensure that you get the same # .. ie you start with one "3" .. you end with one "3" .. etc.

edited Sep 21 '15 at 18:51

answered Sep 21 '15 at 18:31

Ditto

3,256
1
14
28

1

The values should only be shuffled for a single column based on a groupby. In your results the values of B are shuffled across groups. – fsociety Sep 21 '15 at 18:44
oh ok .. let me tweak that ;) – Ditto Sep 21 '15 at 18:48
Added a bit more explanation as well - hope it works better now ? ;) – Ditto Sep 21 '15 at 18:53
Okay thanks for the tweak. Unfortunately, ```dbms_random.value``` does not work in Google's BigQuery in which I operate (I edited my original post mentioning that). Your solution seems correct though. I hope someone else has another idea. – fsociety Sep 21 '15 at 18:55
Yeah saw your update :) not familiar with Google's BigQuery, so see what else might work in it's place ? sorry can't help outside of Oracle ;) – Ditto Sep 21 '15 at 18:57
This is a very good answer, and may not require too much tweaking to get it to work with Google BigQuery. Docs suggest that Googe BigQuery uses `RAND()` instead of Oracle's `dbms_random.value`. May be worth a try. – sstan Sep 21 '15 at 19:17
Gravy, @sstan ... post a new answer with the correction and get credit for it :) I'm cool ... I'll leave mine here as a base/reference ;) – Ditto Sep 21 '15 at 19:19
```RAND()``` does not work inside a window function though. A workaround could be to specify random as a column first as suggested in http://stackoverflow.com/questions/23375456/random-sampling-in-google-bigquery – fsociety Sep 21 '15 at 19:21
Basically, I think Ditto's query solved the hard part, and he deserves full credit for it. I'm sure that you'll find the way to adjust it for Google BigQuery, but those should be minor details. Good luck. – sstan Sep 21 '15 at 19:23

score 1 · Answer 2 · answered Sep 21 '15 at 20:23

I feel below should work in BigQuery

SELECT
 x.A as A, x.B as Old_B, x.c as C, y.B as New_B
FROM (
  SELECT A, B, C, 
  ROW_NUMBER() OVER(PARTITION BY A ORDER BY B, C) as pos
  FROM [your_table]
) as x
JOIN (
  SELECT 
    A, B, ROW_NUMBER() OVER(PARTITION BY A ORDER BY rnd) as pos
  FROM (
      SELECT A, B, RAND() as rnd
      FROM [your_table]
  )
) as y
ON x.A = y.A AND x.pos = y.pos

Shuffle column in Google's BigQuery based on groupby

2 Answers2