6

Below I have a code where a read a csv file and take a random sample of 700 from the file. I need to do this on multiple files, but if I iterate over the files, the sample (as it is random) will be different for each file, wheras I want to keep it the same once it's randomly generated.

df = pd.read_csv(file.csv, delim_whitespace=True)
df_s = df.sample(n=700)

My ideas are to take the row number and then pass it to the next file, however this does not seem to be very elegant.

Do you know any good solutions to this issue?

CLARIFICATION

The file lengths are different, but there is a minimum file length: 750.

desired outcome EXAMPLE

df1 = pd.read_csv(file1.csv, delim_whitespace=True)
df_s1 = df1.sample(n=700) # choose random sample

df2 = pd.read_csv(file2.csv, delim_whitespace=True)
df_s2 = df2.sample(n=700) # use same random sample as above
Newskooler
  • 3,973
  • 7
  • 46
  • 84

2 Answers2

8

I think you can use random_state parameter in sample, but it works only if same sizes of all files, so add parameter nrows to read_csv:

df = pd.read_csv(file.csv, delim_whitespace=True, nrows=750)
df_s = df.sample(n=700, random_state=123)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • How can the `np.random.seed(123)` help, as I don't see it being assigned anywhere. Could you please elaborate? Also `.sample` has a option `random_state`, but I am not sure what it does. – Newskooler Jul 19 '17 at 13:04
  • okay, could please show how to use it, as I tried both, but I cannot get he same sample size when I generate samples from the same data frame or different files. – Newskooler Jul 19 '17 at 13:09
  • I updated a clarifying comment, as the length, albeit different, has a minimum length. These questions, make me think it's actually more complex then thought myself. – Newskooler Jul 19 '17 at 13:17
0

Another option would be to set the np.random.seed(123).

This has the advantage that it sets the random seed for all pandas functions at once.

A more detail answer can be found here

clfaster
  • 1,450
  • 19
  • 26