Comparing data in 2 separate DataFrames and producing a result in Python/Pandas

Question

I am new to Python and I'm trying to produce a similar result of Excel's IndexMatch function with Python & Pandas, though I'm struggling to get it working.

Basically, I have 2 separate DataFrames:

The first DataFrame ('market') has 7 columns, though I only need 3 of those columns for this exercise ('symbol', 'date', 'close'). This df has 13,948,340 rows.

The second DataFrame ('transactions') has 14 columns, though only I only need 2 of those columns ('i_symbol', 'acceptance_date'). This df has 1,428,026 rows.

My logic is: If i_symbol is equal to symbol and acceptance_date is equal to date: print symbol, date & close. This should be easy.

I have achieved it with iterrows() but because of the size of the dataset, it returns a single result every 3 minutes - which means I would have to run the script for 1,190 hours to get the final result.

Based on what I have read online, itertuples should be a faster approach, but I am currently getting an error:

ValueError: too many values to unpack (expected 2)

This is the code I have written (which currently produces the above ValueError):

for i_symbol, acceptance_date in transactions.itertuples(index=False):
  for symbol, date in market.itertuples(index=False):
    if i_symbol == symbol and acceptance_date == date:
        print(market.symbol + market.date + market.close)

2 questions:

Is itertuples() the best/fastest approach? If so, how can I get the above working?
Does anyone know a better way? Would indexing work? Should I use an external db (e.g. mysql) instead?

Thanks, Matt

Using for-loops with `pandas` / `numpy` is a last-ditch approach that essentially removes all the advantages of these libraries. There may be a better approach, but I am having a hard time understanding what you are trying to accomplish. Your approach here is the brute-force approach, and it's no wonder it is taking forever. Please read [this](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and consider supplying a small,self-contained example of an input with an expected output. On the other hand, it does sound like a use-case for a relational db. — juanpa.arrivillaga, Mar 31 '17 at 22:10
Simply merge: `pd.merge(market[['symbol', 'date', 'close']], transactions[['i_symbol', 'acceptance_date']], left_on=['symbol', 'date'], right_on=['i_symbol', 'acceptance_date'])`. No need for looping. — Parfait, Apr 01 '17 at 01:42

acidtobi · Answer 1 · 2017-03-31T23:30:09.093

Regarding question 1: pandas.itertuples() yields one namedtuple for each row. You can either unpack these like standard tuples or access the tuple elements by name:

for t in transactions.itertuples(index=False):
  for m in market.itertuples(index=False):
    if t.i_symbol == m.symbol and t.acceptance_date == m.date:
        print(m.symbol + m.date + m.close)

(I did not test this with data frames of your size but I'm pretty sure it's still painfully slow)

Regarding question 2: You can simply merge both data frames on symbol and date.

Rename your "transactions" DataFrame so that it also has columns named "symbol" and "date":

transactions = transactions[['i_symbol', 'acceptance_date']]
transactions.columns = ['symbol','date']

Then merge both DataFrames on symbol and date:

result = pd.merge(market, transactions, on=['symbol','date'])

The result DataFrame consists of one row for each symbol/date combination which exists in both DataFrames. The operation only takes a few seconds on my machine with DataFrames of your size.

score 0 · Answer 2 · answered Apr 01 '17 at 05:26

@Parfait provided the best answer below as a comment. Very clean, worked incredibly fast - thank you.

 pd.merge(market[['symbol', 'date', 'close']], transactions[['i_symbol',
 'acceptance_date']], left_on=['symbol', 'date'], right_on=['i_symbol',
 'acceptance_date']).

No need for looping.

Comparing data in 2 separate DataFrames and producing a result in Python/Pandas

2 Answers2