I'm trying to take the data in two different excel workbooks, each with only one sheet, and join or merge them together. The first is a sheet with about 282,000 rows of data, and I'm merging a second sheet with about 13,000 rows of data to it via a common column. It's a one to many join. The code I have currently works, but it takes about 2.5 hours to run, and I feel like there should be a way to make it more efficient. Below is the code I have:
import pandas
df1 = pandas.read_excel('file1.xlsx')
df2 = pandas.read_excel('file2.xlsx')
final_file = pandas.merge(df1, df2, left_on='OWNER', right_on='ENTITY')
final_file.to_excel('file3.xlsx', index=False)
So how can I make this run faster? Should I be using something other than pandas?
EDIT: So what takes so long is the final_file.to_excel I think. Is there a different/better way to write the merged data? Maybe writing it to a new sheet in df1?
df1
Owner Prop Decimal
AND15 1031 0.00264
AND15 1032 0.03461
AND16 1037 0.00046
df2
Entity Address Fax
AND15 Fake 123 555-555-5555
AND16 Fake 456 555-555-5544
final_file
Owner Prop Decimal Entity Address Fax
AND15 1031 0.00264 AND15 Fake 123 555-555-5555
AND15 1032 0.03461 AND15 Fake 123 555-555-5555
AND16 1037 0.00046 AND16 Fake 456 555-555-5544
Etc on the data. So it's matching Owner and Entity, and then adding the columns from df2 onto the end of matched rows in df1.
EDIT 2: It seems that trying to write the result to .xlsx is the issue, and I guess I'm running out of RAM on the pc. Doing final_file.to_csv takes it less than a minute. Lesson learned I guess.
Thanks,
Ethan