0

I have a list of person IDs, and for each ID, I want to extract all available information from two different dataframes. In addition, the types of information also have IDs, and I only want specific information IDs for each person ID. Here's how I'm currently doing this:

    new_table = []
    for i in range(ranges):

        slice = pd.concat([df1[sp.logical_and(df1.person.values == persons[i],
                                                   df1['info_id'].isin(info_ids))],
                df2[sp.logical_and(df2.person.values == persons[i],
                                      df2['info_id'].isin(info_ids))]], ignore_index=True)

        if len(list(set(slice['info_ids']))) < amount_of_info_needed:
                    continue
        else:
            full_time_range = max(slice['age_days']) - min(slice['age_days']) 
            if full_time_range <= 1460:
                new_table.append(slice)
            else:
                window_end = min(slice['age_days']) + 1460
                slice = slice[slice.age_days < window_end+1]
                if len(list(set(slice['info_id']))) < amount_of_info_needed:
                    continue
                else:
                    new_table.append(slice)
    #return new_table
    new_table = pd.concat(new_table, axis=0)
    new_table = new_table.groupby(['person', 'info_id']).agg(np.mean).reset_index()
    new_table.to_sql('person_info_within4yrs', engine, if_exists='append', index=False, 
                 dtype={'person': types.NVARCHAR(32), 'value': types.NVARCHAR(4000)})

I read about not using pd.concat in a loop because of quadratic time, but I tried converting the dataframes to arrays and slicing and concatenating those, but that went even slower than using pd.concat. After profiling each line with %lprun, all of the time is being consumed with the pd.concat/logical_and operation in the loop. This code is also faster than using .loc with both dataframes and concatenating two slices together. After the if-else blocks, I append to a list and at the end, turn the list into a dataframe.

Edit: Here is an example of what I'm doing. The goal is to slice from both dataframes by person_id and info_id, combine the slices, and append the combined slice to a list, which I will then turn back into a dataframe and export to a SQL table. The if-else blocks are relevant too, but from my profiling they take barely any time at all so I'm not going to describe them in detail.

df1.head()
    person  info_id value   age_days
0   000012eae6ea403ca564e87b8d44d0bb    0   100.0   28801
1   000012eae6ea403ca564e87b8d44d0bb    0   100.0   28803
2   000012eae6ea403ca564e87b8d44d0bb    0   100.0   28804
3   000012eae6ea403ca564e87b8d44d0bb    0   100.0   28805
4   000012eae6ea403ca564e87b8d44d0bb    0   100.0   28806

df2.head()
    person  info_id value   age_days
0   00000554787a3cb38131c3c38578cacf    4v  97.0    12726
1   00000554787a3cb38131c3c38578cacf    14v 180.3   12726
2   00000554787a3cb38131c3c38578cacf    9v  2.0 12726
3   00000554787a3cb38131c3c38578cacf    3v  20.0    12726
4   00000554787a3cb38131c3c38578cacf    0v  71.0    12726
  • Please post a few rows of all dataframes. Also, why not `merge` subset cols by ID? – Parfait Feb 27 '18 at 15:05
  • Welcome to SO! Please provide a sample data. Refer to this post: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – YOLO Feb 27 '18 at 15:05
  • I tried iteratively merging and that was slower, but are you suggesting merging both dataframes at the beginning and only slicing from one dataframe? – cherrytomato967 Feb 27 '18 at 15:13

1 Answers1

0

I took Parfait's advice and first concatenated both dataframes into one, then a coworker gave me a solution to iterate through the dataframe. The dataframe consisted of ~117M rows with ~246K person IDs. My coworker's solution was to create a dictionary where each key is a person ID, and the value for each key is a list of row indices for that person ID in the dataframe. You then use .iloc to slice the dataframe by referencing the values in the dictionary. Finished running in about one hour.

idx = df1['person'].reset_index().groupby('person')['index'].apply(tuple).to_dict()

for i in range(ranges):
    mrn_slice = df1.iloc[list(idx.values()[i])]