2
ID  outcome Source_doc
23145   A   ARR
23145   A   CRE
23145   B   ARR
23145   C   CRE
23456   B   ARR
23456   B   CRE

ID #145 from ARR has [A,B] outcomes. ID #145 from CRE has [A,C] outcomes. Below you can see I would then place ID #145 in the 'not_same_list'. My data set includes 445,000 rows. And the process I perform takes 21 seconds per 100 rows. So this will take over 7 hours!

What is the slowest part of this loop?

Am I performing the Pandas searches most efficiently?

Would iterrows() be faster?

EDIT: Good point about the expected output. I'm actually just expecting a list of IDs. If the AAR_list and the CRE_list are not identical I want to flag that ID and put it into a list (not_same_list). So I'm looking for [145, 178, ..., 989, (any ID where outcomes don't match between it's Source Docs)]

not_same_list = []
total_search_start_time = time.time()
tick = 0
for IDs in uniq_IDs['ID'].unique():
    #Isolate rows by their ID and source doc
    sco_ARR = uniq_IDs['outcome'][uniq_IDs['ID'] == IDs][uniq_IDs['Source_Doc'] == 'ARR']
    sco_CRE = uniq_IDs['outcome'][uniq_IDs['ID'] == IDs][uniq_IDs['Source_Doc'] == 'CRE']
    #Remove duplicates 
    ARR_list = set(sco_ARR.values.tolist())
    CRE_list = set(sco_CRE.values.tolist())

#Check to see if outcomes match between source docs
if ARR_list != CHRI_list:
    not_same_list.append(IDs)       

if str(tick)[-2:] == '00':
    print ('The last {} rows have taken {} seconds...'.format(tick,round(time.time()-total_search_start_time,2)))
    tick += 1 
else:
    tick += 1

print ('The last {} rows have taken {} seconds...'.format(tick,round(time.time()-total_search_start_time,2))) 
print (not_same_list)

If anyone can make a better table for this question please do:

Liam Hanninen
  • 1,525
  • 2
  • 19
  • 37
  • I suspect the `drop_duplicates` method would help. –  Dec 28 '16 at 17:42
  • Could you also put down the expected output? – Zero Dec 28 '16 at 17:46
  • In addition to Jack Maney's suggestion, I think a general danger sign in pandas code is the use of a `for` loop. Composing this as vector / matrix operations usually will speed it up. – Paul Dec 28 '16 at 17:46
  • Actually, is the formatting off, here? It seems to me that the for loop you have here is just setting, and then not using, `ARR_list` and `CRE_list` over and over again. I agree with John Galt here, you should provide an expected output so it is more clear what you want this to do. – Paul Dec 28 '16 at 17:49

2 Answers2

4

A pandas idiomatic way of rewriting your for loop would be:

(df.groupby(['ID', 'Source_doc'])['outcome'].apply(set)
   .groupby(level=0).nunique()[lambda x: x==2].index)

# Int64Index([23145], dtype='int64', name='ID')

The reason your for loop is slow is because you are dealing with unsorted data, that is you subset your data frame by ID and then subset it by Source_doc in which way you go through the data frames multiple times in a vector scanning fashion (depending on how many unique IDs and Source_doc you have); Using groupby() avoids this problem since it sorts the data frame by the group variables and then deal with it chunk by chunk;

To see more about this philosophy, check this answer.

Community
  • 1
  • 1
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • Thanks for the clarification. I'm going to try this and compare times. – Liam Hanninen Dec 28 '16 at 17:59
  • This actually completed already, so my head is spinning. I'm just verifying the results. It would be amazing to go from 7 hours to 5 minutes. Does this make sense? – Liam Hanninen Dec 28 '16 at 18:09
  • If you have a lot of `IDs`, this makes sense, the number of IDs determines how many times your for loop has to go through the whole data frame and subset it, which can be very time consuming. – Psidom Dec 28 '16 at 18:13
  • 1
    Great, well it seems to check out. Thanks! – Liam Hanninen Dec 28 '16 at 18:17
2

I'd try something like this

d1 = df.groupby(['ID', 'Source_doc']).outcome.apply(set).unstack()
print(d1)

Source_doc     ARR     CRE
ID                        
23145       {B, A}  {A, C}
23456          {B}     {B}

Then check if equal

d1.ARR == d1.CRE

ID
23145    False
23456     True
dtype: bool

You can filter d1's index to get the list of not_equal ID's

d1.index[d1.ARR != d1.CRE]

Int64Index([23145], dtype='int64', name='ID')
piRSquared
  • 285,575
  • 57
  • 475
  • 624