2

I have a table in BigQuery. I have a certain string column which represents a unique id (uid). I want to filter only a sample of this table, by taking only a portion of the uids (let's say 1/100). So my idea is to sample the data by doing something like this:

if(ABS(HASH(uid)) % 100 == 0) ...

The problem is this will actually filter in 1/100 ratio only if the distribution of the hash values is uniform. So, in order to check that, I would like to generate the following table:

(n goes from 0 to 99)

0    <number of rows in which uid % 100 == 0>
1    <number of rows in which uid % 100 == 1>
2    <number of rows in which uid % 100 == 2>
3    <number of rows in which uid % 100 == 3>

.. etc.

If I see the numbers in each row are of the same magnitude, then my assumption is correct.

Any idea how to create such a query, or alternatively do the sampling another way?

Yaniv Donenfeld
  • 565
  • 2
  • 8
  • 26

2 Answers2

4

Something like

Select ABS(HASH(uid)) % 100 as cluster , count(*) as cnt 
From yourtable 
Group each by cluster 

the UID is of different cases (upper, lower) and types you can use some string manipulation within the hash. something like:

 Select ABS(HASH(upper(string(uid)))) % 100 as cluster , count(*) as cnt 
From yourtable 
Group each by cluster 
N.N.
  • 3,094
  • 21
  • 41
3

As an alternative to HASH(), you can try RAND() - it doesn't depend on the ids being uniformly distributed.

For example, this would give you 10 roughly equally sized partitions:

SELECT word, INTEGER(10*RAND()) part
FROM [publicdata:samples.shakespeare]

Verification:

SELECT part, COUNT(*) FROM (
    SELECT word, INTEGER(10*RAND()) part
    FROM [publicdata:samples.shakespeare]
)
GROUP BY part
ORDER BY part

Each group ends up with about 16465 elements.

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • And you can even put the RAND part in the WHERE clause: SELECT * FROM table WHERE RAND() < 0.01 – Jeremy Condit Oct 21 '14 at 06:00
  • RAND is great for a random portion of the data. But if I look for consistency (Having the same record in the same partition every time I query the same table), would it still work? – N.N. Oct 21 '14 at 06:36
  • Thanks all! However, as a followup on this, I found that I will not be able to use hash() or rand(). The reason is that the same session id appears in multiple types, and I need to filter a certain id from all types. This means that I need a function that given the same session id will produce the same integer number (on which I will sample by doing % 100) in order for the filtering to be deterministic for a given id in any table. – Yaniv Donenfeld Oct 21 '14 at 06:56
  • @FelipeHoffa I tried the first query and got `Error: Encountered " "PARTITION" "partition "" at line 1, column 35. Was expecting: ` – ffflabs Jan 27 '15 at 18:10
  • PARTITION is now a reserved keyword - I fixed the query now! – Felipe Hoffa Jan 27 '15 at 22:30