I am writing some tests and I am using Pandas DataFrames to house a large dataset ~(600,000 x 10). I have extracted 10 random rows from the source data (using Stata) and now I want to write a test see if those rows are in the DataFrame in my test suite.
As a small example
np.random.seed(2)
raw_data = pd.DataFrame(np.random.rand(5,3), columns=['one', 'two', 'three'])
random_sample = raw_data.ix[1]
Here raw_data
is:
And random_sample
is derived to guarantee a match and is:
Currently I have written:
for idx, row in raw_data.iterrows():
if random_sample.equals(row):
print "match"
break
Which works but on the large dataset is very slow. Is there a more efficient way to check if an entire row is contained in the DataFrame?
BTW: My example also needs to be able to compare np.NaN
equality which is why I am using the equals()
method