2

I'm trying to take the data in two different excel workbooks, each with only one sheet, and join or merge them together. The first is a sheet with about 282,000 rows of data, and I'm merging a second sheet with about 13,000 rows of data to it via a common column. It's a one to many join. The code I have currently works, but it takes about 2.5 hours to run, and I feel like there should be a way to make it more efficient. Below is the code I have:

import pandas

df1 = pandas.read_excel('file1.xlsx')
df2 = pandas.read_excel('file2.xlsx')

final_file = pandas.merge(df1, df2, left_on='OWNER', right_on='ENTITY')
final_file.to_excel('file3.xlsx', index=False)

So how can I make this run faster? Should I be using something other than pandas?

EDIT: So what takes so long is the final_file.to_excel I think. Is there a different/better way to write the merged data? Maybe writing it to a new sheet in df1?

df1
Owner  Prop    Decimal
AND15  1031    0.00264
AND15  1032    0.03461
AND16  1037    0.00046

df2
Entity  Address    Fax
AND15   Fake 123   555-555-5555
AND16   Fake 456   555-555-5544

final_file
Owner  Prop    Decimal   Entity  Address    Fax
AND15  1031    0.00264   AND15   Fake 123   555-555-5555
AND15  1032    0.03461   AND15   Fake 123   555-555-5555
AND16  1037    0.00046   AND16   Fake 456   555-555-5544

Etc on the data. So it's matching Owner and Entity, and then adding the columns from df2 onto the end of matched rows in df1.

EDIT 2: It seems that trying to write the result to .xlsx is the issue, and I guess I'm running out of RAM on the pc. Doing final_file.to_csv takes it less than a minute. Lesson learned I guess.

Thanks,

Ethan

user3712099
  • 23
  • 1
  • 4
  • 2
    What's the result of `len(df1), len(df2), len(final_file)` ? What's the unique count of the joining columns on both DFs. Are you sure you're not doing a many-many join etc...? – Jon Clements Aug 09 '18 at 13:24
  • So the 2nd excel sheet (of ~13k length) is all unique on the joining column. I'm basically taking the data from df2, and finding where the "entity" column matches the "owner" column from df1 and then copying the rest of the data over. len(df1) is 273882, len(df2) is 13098, and len(final_file) is 273882. – user3712099 Aug 09 '18 at 13:41
  • It seems curious that your final one ends up the same length as df1 given a merge is an inner join by default. Do you have a low spec system where it just so happens you've got such a large columns of columns (and potentially large sized columns) and overlapping columns that result in more columns) your system is thrashing your swapfile? – Jon Clements Aug 09 '18 at 13:46
  • Yeah, that's possible. I'm supposed to be getting a new PC soon, so that idea will be tested soon. So the length isn't changing because I'm not adding new rows, I'm adding new data (columns) to the end of the current rows in df1. I'll try to add an example of what it's doing in the question. – user3712099 Aug 09 '18 at 13:48
  • `to_excel` can be quite a slow operation, particularly with large amounts of data. Could you try writing to a csv instead? – asongtoruin Aug 09 '18 at 14:09
  • Yeah, that's what I ended up doing. Thanks! – user3712099 Aug 09 '18 at 14:23

2 Answers2

0

Below code will take lesser time to append and export.

1.Append the df1 with df2 and then export it into csv.

Main_df = df1.append(df2)

Note :- Remove header of that specific df which ever you going to append.

Jitesh Vacheta
  • 85
  • 1
  • 13
  • I'll mark this one as the best answer so far and edit into my question that exporting to csv was much faster. The issue seems to be running out of RAM when trying to write to .xlsx. – user3712099 Aug 09 '18 at 14:07
  • This cannot be the answer, as append is not equivalent to merge, append will extend the number rows in original the data frame... I would specify that exporting as csv was helpful in reducing processing time – RK1 Aug 09 '18 at 16:42
  • Good point. Unselected the answer and hopefully the edits in my original post will serve to let people know that was the solution to my problem. – user3712099 Aug 09 '18 at 20:11
  • Yeah they are, EDIT2 is particularly helpful! You can answer your own question too... – RK1 Aug 10 '18 at 07:10
0

It sounds as if the importing of data is the bottleneck. I would try the below threads to speed up the imports:

Quick Test of Pandas Merge Speed using similar Len dimensions:

import time
import pandas as pd
import numpy as np
df1_test = pd.DataFrame.from_items(zip(["Col1","Col2","Col3"], [np.arange(273882),np.arange(273882),np.arange(273882)]))
df2_test = pd.DataFrame.from_items(zip(["Col1","Col2","Col3"], [np.arange(13098),np.arange(13098),np.arange(13098)]))

Time merge of dataframes

startTime = time.time(); df3_test = pd.merge(df1_test, df2_test, left_on='Col1', right_on='Col1'); print ('The script took {0} second !'.format(time.time() - startTime))

The script took 0.0390000343323 second !

You could Try this across your import sections, merge sections and write sections of your code and optimise this section accordingly.

RK1
  • 2,384
  • 1
  • 19
  • 36