1

I am writing some tests and I am using Pandas DataFrames to house a large dataset ~(600,000 x 10). I have extracted 10 random rows from the source data (using Stata) and now I want to write a test see if those rows are in the DataFrame in my test suite.

As a small example

np.random.seed(2)
raw_data = pd.DataFrame(np.random.rand(5,3), columns=['one', 'two', 'three'])
random_sample = raw_data.ix[1]

Here raw_data is:

enter image description here

And random_sample is derived to guarantee a match and is:

enter image description here

Currently I have written:

for idx, row in raw_data.iterrows():
    if random_sample.equals(row):
        print "match"
        break

Which works but on the large dataset is very slow. Is there a more efficient way to check if an entire row is contained in the DataFrame?

BTW: My example also needs to be able to compare np.NaN equality which is why I am using the equals() method

sanguineturtle
  • 1,425
  • 2
  • 15
  • 29

2 Answers2

4

equals doesn't seem to broadcast, but we can always do the equality comparison manually:

>>> df = pd.DataFrame(np.random.rand(600000, 10))
>>> sample = df.iloc[-1]
>>> %timeit df[((df == sample) | (df.isnull() & sample.isnull())).all(1)]
1 loops, best of 3: 231 ms per loop
>>> df[((df == sample) | (df.isnull() & sample.isnull())).all(1)]
              0         1         2         3         4         5         6  \
599999  0.07832  0.064828  0.502513  0.851816  0.976464  0.761231  0.275242   

               7        8         9  
599999  0.426393  0.91632  0.569807  

which is much faster than the iterative version for me (which takes > 30s.)

But since we have lots of rows and relatively few columns, we could loop over the columns, and in the typical case probably cut down substantially on the number of rows to be looked at. For example, something like

def finder(df, row):
    for col in df:
        df = df.loc[(df[col] == row[col]) | (df[col].isnull() & pd.isnull(row[col]))]
    return df

gives me

>>> %timeit finder(df, sample)
10 loops, best of 3: 35.2 ms per loop

which is roughly an order of magnitude faster, because after the first column there's only one row left.

(I think I once had a much slicker way to do this but for the life of me I can't remember it now.)

DSM
  • 342,061
  • 65
  • 592
  • 494
  • the finder column iteration is the fastest. On my data I get ~1.63s per loop as I am testing for many random samples. My filtering approach came in at ~1.72s per loop. Simple Broadcasting comes in at 4.14s per loop – sanguineturtle Jul 30 '14 at 04:32
  • @sanguineturtle: if you're doing multiple searches you could experiment with using `isin` to cull or sorting the frame and then searching inside that. (Sorting might be too costly if you're only doing a handful of searches, though.) – DSM Jul 30 '14 at 04:37
  • I was revisiting this code and spotted a small error I think regarding the finder() function above. While it checks each value - this actually returns a "matched" row which is based on a match with the last column values only? – sanguineturtle Mar 12 '15 at 06:49
  • @sanguineturtle: I'm not sure what you mean. If you change the row values you're searching for so one of them doesn't match, `finder` will return an empty frame. – DSM Mar 12 '15 at 14:17
0

The best I have come up with is to take a filtering approach which seems to work quite well and prevents a lot of comparisons when the dataset is large:

tmp = raw_data    
for idx, val in random_sample.iteritems():
    try:
        if np.isnan(val):
            continue
    except:
        pass
    tmp = tmp[tmp[idx] == val]
if len(tmp) == 1: print "match"

Note: This is actually a slower for the above small dimensional example. But on a large dataset this ~9 times faster than the basic iteration

sanguineturtle
  • 1,425
  • 2
  • 15
  • 29