7

For example, if I have a column called companyId and many other columns I want to keep, and in companyId I have values like 100, 101, 102, ..., basically a list of Ids and each Id appear different number of times. How do I randomly sample data based on the companyId column so that it's according to the proportion of each Id?

eg: If I have 500 rows and 100 companyA, 100 companyB and 300 companyC and I want to sample 100 rows from this table. How do I make my data have 20 companyA, 20 companyB and 60 companyC?

Alexis Wilke
  • 19,179
  • 10
  • 84
  • 156
Keyang Zhang
  • 123
  • 2
  • 10
  • already answered in stackoverflow please refer https://stackoverflow.com/questions/1124603/grouped-limit-in-postgresql-show-the-first-n-rows-for-each-group – Pradeep Kumar Mar 02 '18 at 04:01

1 Answers1

16

You should use the TABLESAMPLE clause available since PostgreSQL 9.5, described in the SELECT documentation.

There are two built-in table sampling methods: SYSTEM and BERNOULLI. Both try to get a random sample, but the SYSTEM method picks table blocks and uses the data therein, while BERNOULLI uses a random sample from the whole table. SYSTEM is faster, but BERNOULLI offers a more evenly distributed sample.

You have to specify the percentage of the rows that you want to get.
If you want a fifth of your table, you would use:

SELECT * FROM mytable TABLESAMPLE BERNOULLI (20);

If your requirement is to get a certain number of result rows, you might consider using the tsm_system_rows contrib module, which offers the SYSTEM_ROWS table sample method that works similar to SYSTEM, but you can specify the number of result rows.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • does this answer the question? The OP wants to know how to sample "20 companyA, 20 companyB and 60 companyC?" – Vlad Nov 02 '20 at 14:25
  • 1
    You are right, it is an approximation. I assumed that these numbers were just for illustration and represent an estimate rather than hard requirements (note that the numbers are proportional to the frequency with which the values occur in the table). – Laurenz Albe Nov 02 '20 at 14:29
  • thanks for that note, it does help understanding how tablesample works. – Vlad Nov 02 '20 at 15:54