2

I have two different dataframes which I am trying to compare. So, my first data frame has 10 rows and the second has 2,000. What I'm trying to do is compare the first row in my first df to all 2,000 in the other df. Then do the same for the next row in my first df.

Here is the code I have currently. It works fine through the first 2,000, then when i should be incremented it crashes.

i = 1
j = 1
for u in userFrame.iterrows():
    for d in dbFrame.iterrows():
        if userFrame['tag'][i] == dbFrame['tag1'][j]:
            print('Found one!:' + userFrame['tag'][i])
        j += 1
    i += 1

Edit: here are the errors I'm getting:

  File "C:\Users\david\Desktop\CC Project\test.py", line 158, in Analyze
    if userFrame['tag'][i] == dbFrame['tag1'][j]:
  File "C:\Python34\lib\site-packages\pandas\core\series.py", line 557, in __getitem__
    result = self.index.get_value(self, key)
  File "C:\Python34\lib\site-packages\pandas\core\index.py", line 1790, in get_value
    return self._engine.get_value(s, k)
  File "pandas\index.pyx", line 103, in pandas.index.IndexEngine.get_value (pandas\index.c:3204)
  File "pandas\index.pyx", line 111, in pandas.index.IndexEngine.get_value (pandas\index.c:2903)
  File "pandas\index.pyx", line 157, in pandas.index.IndexEngine.get_loc (pandas\index.c:3843)
  File "pandas\hashtable.pyx", line 303, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6525)
  File "pandas\hashtable.pyx", line 309, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6463)
KeyError: 1644
David Schuler
  • 1,011
  • 2
  • 10
  • 21

2 Answers2

1

Consider a cross join pandas merge between both data frames, resulting in 10 X 2,000 records where each record of smaller dataset is matched with every record of larger dataset. The Cross Join is a special SQL query returning the cartesian product (all possible combintations of sets) between two tables. Essentially a query with no join clauses.

However, in pandas, you will need to first create a key variable in both dataframes and rename columns to avoid duplicate tags. From there you can return a matching records data frame and avoid the nested for loop. And since you only need the first row, you can slice the dataframe during cross join operation for a 1 X 2,000 dimension.

userFrame['key'] = 1
dbFrame['key'] = 1

userFrame = userFrame.rename(columns = {'Tag':'Tag_U'})
dbFrame = dbFrame.rename(columns = {'Tag':'Tag_D'})

# CROSS JOIN MERGE (FIRST ROW OF USER DF, ALL ROWS OF DB DF)
crossjoindf = pd.merge(userFrame[:1], dbFrame, on='key')[['Tag_U', 'Tag_D']]

# MATCHING RECORDS
matchingdf = crossjoindf[crossjoindf['Tag_U'] == crossjoindf['Tag_D']]
Community
  • 1
  • 1
Parfait
  • 104,375
  • 17
  • 94
  • 125
1

I don't know if this will fix the problem, but your code has many very unpythonic qualities and if you fix them you may get, at the least, more helpful errors.

First, it is never necessary to use sidecar integers like you have with i and j. You can use enumerate in the worst case, but pandas already gives you an index that does this for you! If you actually look at the output of iterrows() once you have reset_index() on your dfs, it has the sidecar integers you want, built-in. SOLUTION: unpack iterrows()...... but you don't have to use integers, at all, anyway.

Second, pandas can look things up for you! you don't have to write the second iteration yourself. SOLUTION: use .loc

Third, you're using "chained indexing" which is never a good idea in pandas. SOLUTION: use .loc

dbFrame = dbFrame.reset_index().set_index('tag1')
for i,u in userFrame['tag'].iteritems():
    try:
        u2 = dbFrame.loc[u,'col_of_interest']
        print('Found one!: {} = {}'.format(u,u2))
    except:
        pass

No goofy integers, no double iteration. The above code has 10 iterations rather than 20,000. Obviously you need to specify the column of interest.

William Welsh
  • 351
  • 2
  • 11