0

I'd like to get create a random sub-sample of my data.

  1. Spark's sample function (link) is the API I'd like to use. Particularly, because it allows me to toggle if the sampling is done with or without replacement. However, executing this function takes a long time. Based on the answers from this question Spark sample is too slow, it seems like sample requires a full table scan.
  2. TABLESAMPLE seems like a faster alternative, albeit, the ability to toggle with and without replacement is lost.

I'd like to understand how sample and TABLESAMPLE are different, and why does TABLESAMPLE execute faster than sample. Could it be that TABLESAMPLE does not require a full table scan?

Daniel Tan
  • 135
  • 1
  • 2
  • 10

1 Answers1

1

TABLESAMPLE has three ways of sampling:

  • percentage (or fraction): under the hood, does the same thing as sample. It is used to create uniform sampling.
  • num_rows: under the hood, does the same thing as LIMIT, which is why this API is very fast.
  • bucket OUT OF: specifies the portion out of the total to sample.

This is stated on the documentation:

Always use TABLESAMPLE (percent PERCENT) if randomness is important. TABLESAMPLE (num_rows ROWS) is not a simple random sample but instead is implemented using LIMIT.

So the answer whether sample and TABLESAMPLE are the same thing is no, but TABLESAMPLE used with percentage (fraction) and sample are the same thing.

If you want to read more about this, Databricks has some good information about this.

vilalabinot
  • 1,420
  • 4
  • 17
  • Thank you for your answer. I acknowledge that `sample` is different from `TABLESAMPLE` but my question is trying to understand what exactly is the difference, especially as it relates to query plans/stuff under the hood. I observe that TABLESAMPLE is much faster and I have no clue why. – Daniel Tan Sep 19 '22 at 09:38
  • `TABLESAMPLE` with fraction is not faster than `sample`, but `TABLESAMPLE` with a positive integer as argument is, but the sampling is created by taking first X rows of the dataset, meaning that no full table scan is necessary, as there is no need to create a uniform sampling. From the documentation: `LIMIT`: Returns a new Dataset by taking the first n rows. -> taking first N rows requires no table scan, but the sampling is not proper (we can take first 100 rows without checking the whole dataset). `SAMPLE`: Returns a new Dataset by sampling a fraction of rows – vilalabinot Sep 19 '22 at 11:05
  • @vilabinot thanks, i think your answer sufficiently addresses my question, if you'd like to submit that as an "Answer" I would be happy to mark it as the solution. – Daniel Tan Sep 20 '22 at 08:54