6

I am asking about this feature:

df.sample(frac=0.5, replace=True, random_state=1)

available as an option upon sampling a DataFrame.

On the pandas reference, it says it is to:

Sample with or without replacement.

What does this mean and what are some uses for this?

JJJ
  • 32,902
  • 20
  • 89
  • 102
Akaisteph7
  • 5,034
  • 2
  • 20
  • 43
  • 1
    If you google "Sampling with and without replacement", you'll get many explanations from statistics. For example: https://web.ma.utexas.edu/users/parker/sampling/repl.htm. It's the same idea in pandas. With replacement means you can get the same row repeated in the output. – Dan Jul 18 '19 at 14:02

1 Answers1

5

It indicates if an input row could appear more than once in the output.

Sample:

df = pd.DataFrame({'a': range(10)})

# Here, row 5 is duplicated
print (df.sample(frac=0.5, replace=True, random_state=1))

5  5
8  8
9  9
5  5
0  0

# Here, all values are unique
print (df.sample(frac=0.5, replace=False, random_state=1))
   a
2  2
9  9
6  6
4  4
0  0

You can check this related answer:

It controls whether the sample is returned to the sample pool. If you want only unique samples then this should be false.

Akaisteph7
  • 5,034
  • 2
  • 20
  • 43
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 3
    That's not necessarily true, if the dataframe itself contains duplicate rows, you could still get duplicates without replacement. – Dan Jul 18 '19 at 14:01