ID outcome Source_doc
23145 A ARR
23145 A CRE
23145 B ARR
23145 C CRE
23456 B ARR
23456 B CRE
ID #145 from ARR has [A,B] outcomes. ID #145 from CRE has [A,C] outcomes. Below you can see I would then place ID #145 in the 'not_same_list'. My data set includes 445,000 rows. And the process I perform takes 21 seconds per 100 rows. So this will take over 7 hours!
What is the slowest part of this loop?
Am I performing the Pandas searches most efficiently?
Would iterrows() be faster?
EDIT: Good point about the expected output. I'm actually just expecting a list of IDs. If the AAR_list and the CRE_list are not identical I want to flag that ID and put it into a list (not_same_list). So I'm looking for [145, 178, ..., 989, (any ID where outcomes don't match between it's Source Docs)]
not_same_list = []
total_search_start_time = time.time()
tick = 0
for IDs in uniq_IDs['ID'].unique():
#Isolate rows by their ID and source doc
sco_ARR = uniq_IDs['outcome'][uniq_IDs['ID'] == IDs][uniq_IDs['Source_Doc'] == 'ARR']
sco_CRE = uniq_IDs['outcome'][uniq_IDs['ID'] == IDs][uniq_IDs['Source_Doc'] == 'CRE']
#Remove duplicates
ARR_list = set(sco_ARR.values.tolist())
CRE_list = set(sco_CRE.values.tolist())
#Check to see if outcomes match between source docs
if ARR_list != CHRI_list:
not_same_list.append(IDs)
if str(tick)[-2:] == '00':
print ('The last {} rows have taken {} seconds...'.format(tick,round(time.time()-total_search_start_time,2)))
tick += 1
else:
tick += 1
print ('The last {} rows have taken {} seconds...'.format(tick,round(time.time()-total_search_start_time,2)))
print (not_same_list)
If anyone can make a better table for this question please do: