I am trying to get the difference in the two dataframe. So, I want to delete the number of records which are different and make separate dataframes from it.I performed as explained here Comparing two dataframes and getting the differences:
train_abusive=pd.read_csv('train_abusive.csv',low_memory=False)
train_non_abusive=pd.read_csv('train_non_abusive.csv',low_memory=False)
print len(train_abusive),len(train_non_abusive)
val_abusive=train_abusive.sample(frac=0.1)
val_non_abusive=train_non_abusive.sample(frac=0.2)
train_abusive=pd.concat([val_abusive,train_abusive],ignore_index=True)
train_abusive=train_abusive.drop_duplicates(keep=False)
train_non_abusive=pd.concat([val_non_abusive,train_non_abusive],ignore_index=True)
train_non_abusive=train_non_abusive.drop_duplicates(keep=False)
print len(train_abusive),len(train_non_abusive)
It gives the following output:
50000 200000
44596 155010
But the math doesn't work out. I am not sure why.