Optimize building a new array column from other dataframes

Question

I have the following 3 dataframes that I am trying to combine:

Leads

                   Id            Name              Title
0  00Q6F00000zEkMXUA0       V.B Swamy  Managing Director
1  00Q6F00000zEkMXUA1    Vandana Suri      Founder & CEO
2  00Q6F00000zEkMXUA2      Jane Smith            Advisor

Campaigns

                    Id                      Name    NumberOfLeads
0   7016F000001Oo2xQAC        Testing Campaign A                1
1   7016F000001bHoHQAU        Testing Campaign B                2

Campaign Members

             CampaignId              LeadId
0    7016F000001Oo2xQAC  00Q6F00000zEkMXUA0
1    7016F000001bHoHQAU  00Q6F00000zEkMXUA0
2    7016F000001bHoHQAU  00Q6F00000zEkMXUA1

The final output I am trying to get to is:

Leads with Campaigns

                   Id            Name              Title                   Campaigns
0  00Q6F00000zEkMXUA0       V.B Swamy  Managing Director      ['Testing Campaign A', 'Testing Campaign B']
1  00Q6F00000zEkMXUA1    Vandana Suri      Founder & CEO      ['Testing Campaign B']
2  00Q6F00000zEkMXUA2      Jane Smith            Advisor      []

The above output is generated by getting the list of Campaign Members (which are Leads) for each Campaign, and then adding that info as a new Campaigns column in the Leads dataframe.

I was able to implement this myself with the following logic, but am running into issues when the Leads and Campaign Members dataframes are very large. My machine runs out of memory half way through processing the data.

# List of all campaign names we process
campaign_keys = []

for index, row in campaigns.iterrows():
    cname = row['Name']
    cid = row['Id']

    # Get members of this campaign
    matching_members = campaign_members[campaign_members['CampaignId'] == cid]
    
    # Create dataframe of campaign member lead ids
    campaign_df = matching_members['LeadId'].to_frame()
    campaign_df = campaign_df.rename(columns={'LeadId': 'Id'})
    campaign_df[cname] = 1

    # Add to array
    campaign_keys.append(cname)

    # Merge the campaign members with the leads df
    leads_df = leads_df.merge(campaign_df, how='left', left_on='Id', right_on='Id')

    
# Get only the columns for the campaigns we loaded
cdf = leads_df[campaign_keys]

# Build campaigns column
lists_entry = cdf.eq(1).apply(lambda x: list(x.index[x]), axis=1)
leads_df['Campaigns'] = lists_entry

# Drop all List columns
leads_df.drop(campaign_keys, axis=1)

How can I optimize this code to handle larger payloads and more campaigns?

Any help would be greatly appreciated. Thanks!

@mapf The Campaign Members dataframe is used in the logic to determine which Leads belong to which Campaigns. The final output maps Leads -> a list of Campaigns (via the Campaign Members dataframe) — Hassan Syyid, Oct 18 '21 at 19:15
@mapf I don't think there's any typo – I've just updated my sample to include a few more cases here. — Hassan Syyid, Oct 18 '21 at 19:24

not_speshal · Accepted Answer · 2021-10-18T19:41:04.233

1

Try:

merge the campaigns and members frames on the "CampaignId"
groupby and create lists of campaigns for each "LeadId"
merge the leads DataFrame with the newly created frame

campaign_members = campaigns.merge(members, left_on="Id", right_on="CampaignId", how="right")
campaign_lists = campaign_members.groupby("LeadId")["Name"].agg(list)
output = leads.merge(campaign_lists.rename("Campaigns"), left_on="Id", right_index=True, how="left")

>>> output

                    Id          Name                    Title               Campaigns
0   00Q6F00000zEkMXUA0     V.B Swamy        Managing Director       ['Testing Campaign A', 'Testing Campaign B']
1   00Q6F00000zEkMXUA1  Vandana Suri            Founder & CEO       ['Testing Campaign B']
2   00Q6F00000zEkMXUA2    Jane Smith                  Advisor

edited Oct 18 '21 at 19:41

answered Oct 18 '21 at 19:09

not_speshal

22,093
2
15
30

This would just combine the three dataframes together. I'm trying to get that final Campaigns array column I listed in the question – Hassan Syyid Oct 18 '21 at 19:11
And how is the "Campaigns" column in the result not what you want? If it isn't, can you please expand your example? – not_speshal Oct 18 '21 at 19:12
I explained in the question, but basically the **Campaigns** column would part of the **Leads** Dataframe and would be an array of all the columns the Lead is part of. Does that clarify? – Hassan Syyid Oct 18 '21 at 19:17
I mean your output does contain only one element per list so it's a bit difficult to understand. Can you include a case where the list is larger? – not_speshal Oct 18 '21 at 19:18
I've updated my original example with some more cases. One lead which is in 2 campaigns, and another which is in none. Does that help? – Hassan Syyid Oct 18 '21 at 19:21
Testing locally now. What are your thoughts on performance of your solution vs what I had listed in my original question? Should yours be faster / use less memory? – Hassan Syyid Oct 18 '21 at 19:51
Mine will definitely be faster. Looping over DataFrames is never encouraged. See [this](https://stackoverflow.com/a/55557758/9857631) excellent answer. – not_speshal Oct 18 '21 at 20:12

Optimize building a new array column from other dataframes

1 Answers1