0

I have a dataframe, which I need to sample into two, where one sample should not contain entries from the other. I can run two sample operations, but this does not guarantee the same

df.sample(frac=0.8)
df.sample(frac=0.2)

I have tried the follwoing as well. But this throws the error ValueError: cannot compute isin with a duplicate axis.

df1 = df.sample(frac=0.8)
df[~df.isin(df1).all(1)]

What can be done to achieve thsi split


piRSquared's edit

df = pd.DataFrame(np.arange(200).reshape(100, 2), columns=list('AB'))

n_80pct = df.shape[0] // 5 * 4
df_sampled = df.sample(frac=1)
df_80 = df_sampled.iloc[:n_80pct]
df_20 = df_sampled.iloc[n_80pct:]
piRSquared
  • 285,575
  • 57
  • 475
  • 624
Amrith Krishna
  • 2,768
  • 3
  • 31
  • 65
  • @piRSquared - Thanks for poinitng the possible duplicate, which will solve my problem. But aint there a pure pandas solution? – Amrith Krishna Aug 28 '16 at 05:39
  • 1
    My apologies for the inconvenience. However, in my defense, this question is exactly answered by the referenced question and answers. If you wanted a pure pandas answer, showing more of a research effort could have uncovered the other answer and you could have then been more clear as to why that question and answer did not suit your needs. All that said, I still feel a little bit bad. So I've edited you question and included an answer that is pure pandas. – piRSquared Aug 28 '16 at 06:31
  • @piRSquared - Thanks a lot for the effort and passion you put forward in answering the questions. I should have done a bit more homework, before posting the question. – Amrith Krishna Aug 28 '16 at 08:51

0 Answers0