0

I am having performance issues with iterrows in on my dataframe as I start to scale up my data analysis.

Here is the current loop that I am using.

for ii, i in a.iterrows():
    for ij, j in a.iterrows():
        if ii != ij:
            if i['DOCNO'][-5:] == j['DOCNO'][4:9]:
                if i['RSLTN1'] > j['RSLTN1']:
                    dl.append(ij)
                else:
                    dl.append(ii)
            elif i['DOCNO'][-5:] == j['DOCNO'][-5:]:
                if i['RSLTN1'] > j['RSLTN1']:
                    dl.append(ij)
                else:
                    dl.append(ii)
c = a.drop(a.index[dl])

The point of the loop is to find 'DOCNO' values that are different in the dataframe but are known to be equivalent denoted by the 5 characters that are equivalent but spaced differently in the string. When found I want to drop the smaller number from the associated 'RSLTN1' column. Additionally, my data set may have multiple entries for a unique 'DOCNO' that I want to drop the lower number 'RSLTN1' result.

I was successful running this will small quantities of data (~1000 rows) but as I scale up 10x I am running into performance issues. Any suggestions?

Sample from dataset

In [107]:a[['DOCNO','RSLTN1']].sample(n=5)
Out[107]: 
           DOCNO   RSLTN1
6815  MP00064958  72386.0
218   MP0059189A  65492.0
8262  MP00066187  96497.0
2999  MP00061663  43677.0
4913  MP00063387  42465.0
  • We need to find a vectorized approach. Double loops are very inefficient. – Anton vBR Feb 24 '18 at 20:14
  • Can you give me a small example of two DOCNO that are equivalent and should be dropped. – Anton vBR Feb 24 '18 at 20:17
  • Index 218 is an example. There will be another entry in the 'DOCNO' that will be MP00059189. It is also possible that there will be MP0059189B, MP0059189C, etc. in the dataframe. – Doug Guyer Feb 24 '18 at 20:21

1 Answers1

2

How does this fit you needs?

import pandas as pd

s = '''\
DOCNO   RSLTN1
MP00059189  72386.0
MP0059189A  65492.0
MP00066187  96497.0
MP00061663  43677.0
MP00063387  42465.0'''

# Recreate dataframe
df = pd.read_csv(pd.compat.StringIO(s), sep='\s+')

# Create mask 
# We sort to make sure we keep only highest value
# Remove all non-digit according to: https://stackoverflow.com/questions/44117326/
m = (df.sort_values(by='RSLTN1',ascending=False)['DOCNO']
       .str.extract('(\d+)', expand=False)
       .astype(int).duplicated())

# Apply inverted `~` mask 
df = df.loc[~m]

Resulting df:

        DOCNO   RSLTN1
0  MP00059189  72386.0
2  MP00066187  96497.0
3  MP00061663  43677.0
4  MP00063387  42465.0

In this example the following row was removed:

MP0059189A  65492.0
Anton vBR
  • 18,287
  • 5
  • 40
  • 46