I have a list of person IDs, and for each ID, I want to extract all available information from two different dataframes. In addition, the types of information also have IDs, and I only want specific information IDs for each person ID. Here's how I'm currently doing this:
new_table = []
for i in range(ranges):
slice = pd.concat([df1[sp.logical_and(df1.person.values == persons[i],
df1['info_id'].isin(info_ids))],
df2[sp.logical_and(df2.person.values == persons[i],
df2['info_id'].isin(info_ids))]], ignore_index=True)
if len(list(set(slice['info_ids']))) < amount_of_info_needed:
continue
else:
full_time_range = max(slice['age_days']) - min(slice['age_days'])
if full_time_range <= 1460:
new_table.append(slice)
else:
window_end = min(slice['age_days']) + 1460
slice = slice[slice.age_days < window_end+1]
if len(list(set(slice['info_id']))) < amount_of_info_needed:
continue
else:
new_table.append(slice)
#return new_table
new_table = pd.concat(new_table, axis=0)
new_table = new_table.groupby(['person', 'info_id']).agg(np.mean).reset_index()
new_table.to_sql('person_info_within4yrs', engine, if_exists='append', index=False,
dtype={'person': types.NVARCHAR(32), 'value': types.NVARCHAR(4000)})
I read about not using pd.concat in a loop because of quadratic time, but I tried converting the dataframes to arrays and slicing and concatenating those, but that went even slower than using pd.concat. After profiling each line with %lprun, all of the time is being consumed with the pd.concat/logical_and operation in the loop. This code is also faster than using .loc with both dataframes and concatenating two slices together. After the if-else blocks, I append to a list and at the end, turn the list into a dataframe.
Edit: Here is an example of what I'm doing. The goal is to slice from both dataframes by person_id and info_id, combine the slices, and append the combined slice to a list, which I will then turn back into a dataframe and export to a SQL table. The if-else blocks are relevant too, but from my profiling they take barely any time at all so I'm not going to describe them in detail.
df1.head()
person info_id value age_days
0 000012eae6ea403ca564e87b8d44d0bb 0 100.0 28801
1 000012eae6ea403ca564e87b8d44d0bb 0 100.0 28803
2 000012eae6ea403ca564e87b8d44d0bb 0 100.0 28804
3 000012eae6ea403ca564e87b8d44d0bb 0 100.0 28805
4 000012eae6ea403ca564e87b8d44d0bb 0 100.0 28806
df2.head()
person info_id value age_days
0 00000554787a3cb38131c3c38578cacf 4v 97.0 12726
1 00000554787a3cb38131c3c38578cacf 14v 180.3 12726
2 00000554787a3cb38131c3c38578cacf 9v 2.0 12726
3 00000554787a3cb38131c3c38578cacf 3v 20.0 12726
4 00000554787a3cb38131c3c38578cacf 0v 71.0 12726