4

I have one data frame of around 2.5 million records. I also have a dictionary of around 10,000 records. Now, I have a scenario, where I have to loop through every record of the data frame and compare values of one column which is a list of values with another list which is a part of the dictionary and get the first two values out of it(except the values which current list has).

Now, this is below dummy data.

# originally this dictionary has around 10,000 records
temp = {
  0: {
    'AT_1': ['AN_1_2', 'AN_1_0', 'AN_1_1', 'AN_1_3'],
    'AT_2': ['AN_2_1'],
    'AT_3': ['AN_3_0', 'AN_3_1']
  },
  1: {
    'AT_1': ['AN_1_1', 'AN_1_0'],
    'AT_2': ['AN_2_0', 'AN_2_1'],
    'AT_3': ['AN_3_1']
  }
}

# originally this dataframe has around 2.5 million records
Name  Group Points         Subpoints  Additional_Points
   x    0.0   AT_1          [AN_1_0]                NaN
   y    1.0   AT_2  [AN_2_0, AN_2_1]                NaN
   z    1.0   AT_1          [AN_1_1]                NaN
   w    0.0   AT_1               NaN                NaN

And, output should be like this:

# compare each row of value with dictionary on basis of 'Group' and 'Points', then take the first two values except the values present in 'Subpoints'.
Name  Group Points         Subpoints Additional_Points
   x    0.0   AT_1          [AN_1_0]  [AN_1_2, AN_1_1]
   y    1.0   AT_2  [AN_2_0, AN_2_1]                []
   z    1.0   AT_1          [AN_1_1]          [AN_1_0]
   w    0.0   AT_1               NaN               NaN

Now, this is piece of code I am using, which is working fine. But it is very much slow for 2.5 million records.

people['Additional_Points'] = people.apply(lambda x: [i for i in temp[x['Group']][x['Points']] if i not in x['Subpoints']][:2] if x.notna()['Subpoints'] else np.nan, axis=1)

Kindly let me know if there is any way to make this whole operation faster.

Tony Montana
  • 921
  • 18
  • 46
  • Have you tried the `.map()` method? Parse your column as actual `list` type first instead of taking the string literal. Then map the column on the keys to create the `Additional_Points` column which will give you all the values from the dict, but you seem to want to drop the value in that list that is existing in the `Subpoints` column so just use `list` vs `list` operators/methods like `.intersection`? – Joe Jan 22 '20 at 19:10
  • 2
    Honestly, I'm not sure you're going to get anything *efficient*. You're storing lists in a DataFrame and you're going to pay the price of a loop over the rows in one way or another. Even if you find a performant way to do the join with a merge, you'll pay the price to get back to the lists. I'd rethink how you're storing the information. – ALollz Jan 22 '20 at 19:42
  • That being said, in the end it's a row-wise set difference, with different sets for every comparison and I'm stumped to figure out away around looping when the subpoints can have more than a single element. You can probably squeeze out some time using a list comprehension instead of the DataFrame.apply. (See https://stackoverflow.com/a/55557758/4333359) – ALollz Jan 22 '20 at 19:58
  • I think @ALollz made a really good point: _I'd rethink how you're storing the information._ Can you share more context for this? – AMC Jan 22 '20 at 20:00

0 Answers0