I'd like to get create a random sub-sample of my data.
- Spark's
sample
function (link) is the API I'd like to use. Particularly, because it allows me to toggle if the sampling is done with or without replacement. However, executing this function takes a long time. Based on the answers from this question Spark sample is too slow, it seems likesample
requires a full table scan. - TABLESAMPLE seems like a faster alternative, albeit, the ability to toggle with and without replacement is lost.
I'd like to understand how sample
and TABLESAMPLE are different, and why does TABLESAMPLE execute faster than sample
. Could it be that TABLESAMPLE does not require a full table scan?