-2

I was applying .sample with random_state set to a constant and after using set_index it started selecting different rows. A member dropped that was previously included in the subset. I'm unsure how seeding selects rows. Does it make sense or did something go wrong?

Here is what was done:

df.set_index('id',inplace=True, verify_integrity=True)

df_small_F = df.loc[df['gender']=='F'].apply(lambda x: x.sample(n=30000, random_state=47))

df_small_M = df.loc[df['gender']=='M'].apply(lambda x: x.sample(n=30000, random_state=46))

df_small=pd.concat([df_small_F,df_small_M],verify_integrity=True)

When I sort df_small by index and print, it produces different results.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Jon
  • 1
  • 1
  • Can you share parts of your code please? – markuscosinus Mar 26 '19 at 15:04
  • Yes, the dataframe is being read in and not created anywhere else. I have prints of the shapes of df as a check along the way. – Jon Mar 26 '19 at 15:14
  • I'm not sure I understand. With set_index I'm changing the index to use the 'id' column as the value. The .sort_index should be sorting by 'id' right? The issue I'm running into is that .sample is choosing different rows everytime I rerun the data, including pulling it from the source. Nothing is changing. My question is, is sample not using the index but some other measure to choose the rows based on the seed? – Jon Mar 26 '19 at 15:28
  • In your example, the only arguments sample uses are the length of the sampled `df`, `random_state` and `n`. If those don't change, the rows it selects will not change, regardless of index. The behavior you find is not how it should behave and I cannot reproduce your issue, so there is likely an error unrelated to `sample`. Please provide a [mcve] with sample data that reproduces the issue, and likely when trying to do so you may uncover the issue in your code. https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples shows how to give good examples with data. – ALollz Mar 26 '19 at 15:42
  • Issue was solved by another party. The ordering of the data being read in was changing with each run and sorting it by the index before performing the sampling fixed that. Question was about how seeding worked related to .sample(), not just sampling arguments. Changing the ordering affected .sample() and I assume it's how random_state selects rows, which was my main question. The rows it selected did change, despite the arguments of .sample() not changing, like I was asking about. – Jon Mar 26 '19 at 15:59
  • Jon, there's absolutely no way for us to know that you had changed the ordering of your data beforehand. I still stand by my point, that this issue is caused by something outside of your provided data, and that **sample** chooses the exact same row (by integer array index), regardless or the data. However because your row organization changes, the same integer array index selects different data. – ALollz Mar 26 '19 at 16:08
  • My question was how random_state with a seed works. Answering with it chooses the xth row and not by index suffices to solve the issue. I provided where the issue occurred in context of my code. Not knowing how random_state work did made it harder to provide full context, which is why that was the bulk of my question and the sort issue not part of it. – Jon Mar 26 '19 at 16:15

2 Answers2

0

Applying .sort_index() after reading in the data and before performing .sample() corrected the issue. As long as the data remains the same, this will produce the same sample everytime.

Jon
  • 1
  • 1
0

When sampling rows (without weight), the only things that matter are n, the number of rows, and whether or not you choose replacement. This generates the .iloc indices to take, regardless of the data.

For rows, sampling occurs as;

axis_length = self.shape[0]  # DataFrame length

rs = pd.core.common.random_state(random_state)  
locs = rs.choice(axis_length, size=n, replace=replace, p=weights)  # np.random_choice
return self.take(locs, axis=axis, is_copy=False)

Just to illustrate the point

Sample Data

import pandas as pd
import numpy as np

n = 100000
np.random.seed(123)
df = pd.DataFrame({'id': list(range(n)), 'gender': np.random.choice(['M', 'F'], n)})
df1 = pd.DataFrame({'id': list(range(n)), 'gender': ['M']}, 
                    index=np.random.choice(['foo', 'bar', np.NaN], n)).assign(blah=1)

Sampling will always choose row 42083 (integer array index): df.iloc[42803] for this seed and length:

df.sample(n=1, random_state=123)
#          id gender
#42083  42083      M

df1.sample(n=1, random_state=123)
#        id gender  blah
#foo  42083      M     1

df1.reset_index().shift(10).sample(n=1, random_state=123)
#      index       id gender  blah
#42083   nan  42073.0      M   1.0

Even numpy:

np.random.seed(123)
np.random.choice(df.shape[0], size=1, replace=False)
#array([42083])
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • Sampling with a random seed also depends on the order. Like you said, it will always choose the 42083rd row. This is what my question was about. The data's order changed when read in so the 42083rd row changed. Sorting fixed the issue. I was unsure how random_state seeding worked in the context. – Jon Mar 26 '19 at 16:10
  • @Jon Yes, the sample is based on the underlying array indices as I have shown. It has nothing do with the actual DataFrame index (which would be problematic if it were duplicated for instance). So when your data isn't consistently sorted it still samples the same row by `.iloc`, but this row has potentially different information than the prior sample. – ALollz Mar 26 '19 at 16:18