How can I join two dataframes where one column holds two or more values

Question

I have two dataframes similar to this:

A = pd.DataFrame(data={"number": [123, 345], "subject_ids": []})
B = pd.DataFrame(data={"number": [123, 123, 345, 345], "subject_id": [222, 333, 444, 555]})

Meaning: Every number has at least two subject ids.

How would you go about merging these dataframes together, so there would be column "subject_ids" in the A dataframe containing joined list of ids in one cell?

"number": [123, 345], "subject_ids": [[222, 333], [444, 555]]

I've tried lots of methods like this:

A.merge(B, how='left', on='number')

But nothing seems to work. (I couldn't find an answer to this either)

The number is a key and those keys are identical, and the second df stores subjects to those numbers. I want the A dataframe to contain a reference to those subject IDs in a list assigned to one row with that given number. One number can have many subjects.

Complaint dataframe where I want the column with list of all subject IDs associated with the number:

          number  total_complaint_count first_complaint_on last_complaint_on
0     0000000000                     77         2021-10-29        2021-12-05
77   00000000000                      1         2021-11-12        2021-11-12
78  000000000000                      1         2021-11-07        2021-11-07
79   00020056234                      1         2021-11-23        2021-11-23
80    0002266648                      1         2021-11-02        2021-11-02

Subject dataframe that contains the number to be associated with, subject and subject ID.

          number                                            subject  \
787   0000000000                                              Other   
4391  0000000000  Calls pretending to be government, businesses,...   
694   0000000000                     Warranties  & protection plans   
1106  0000000000                                              Other   
4682  0000000000                         Dropped call or no message   

                                subject_id  
787   38d1177e-51e8-4cec-aef8-0112f425091b  
4391  1964fb22-bd20-4d49-beaf-51322a5f5bad  
694   07819535-41b0-44f3-a497-ac2cee16dd1a  
1106  2f348025-3f9f-4861-b151-fbb8a1ac14a3  
4682  15d33ca0-6d90-42ba-9a1d-74e0dcf28539

Info of both dataframes:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 230122 entries, 0 to 281716
Data columns (total 4 columns):
 #   Column                 Non-Null Count   Dtype 
---  ------                 --------------   ----- 
 0   number                 230122 non-null  object
 1   total_complaint_count  230122 non-null  int64 
 2   first_complaint_on     19 non-null      object
 3   last_complaint_on      19 non-null      object
dtypes: int64(1), object(3)
memory usage: 8.8+ MB
---------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 281720 entries, 787 to 9377
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   number      281720 non-null  object
 1   subject     281720 non-null  object
 2   subject_id  281720 non-null  object
dtypes: object(3)
memory usage: 8.6+ MB

It throws `Shape of passed values is (2, 2), indices imply (1, 2)`... — , Dec 09 '21 at 00:13
I recommend modifying the dataframe such that each value is a scalar instead of a sequence — Paul H, Dec 09 '21 at 04:19
Hello @PaulH, what advantages does it have and how do I do it? — Daniel, Dec 09 '21 at 14:54
@Daniel I could tell you how to do that, but it'd be pointless without knowing how and why you have sequences as elements of dataframe in the first place. How did you arrive at this dataframe? Pandas really wasn't designed to efficiently work with sequences. — Paul H, Dec 09 '21 at 16:40
This is just an (I guess bad) attempt to reproduce dataframes I made merging couple of csv files with over 280k records. Don't know if read_csv function makes value scalar or not. @PaulH — Daniel, Dec 09 '21 at 16:57
depends on your CSV, I suppose. You should be able to inspect the dataframe (via `df.info()` ) after you read the CSV. Include sample rows and columns in the question. To really address your problem, context is needed — Paul H, Dec 09 '21 at 17:06

score 1 · Answer 1 · edited Dec 09 '21 at 04:21

Pretty sure I have the answer for you this time.

dfa = pd.DataFrame(columns=['number'],
                 data=np.array([[123],
                               [345]
                                ]))
dfb = pd.DataFrame(columns=['number', 'subject ids'],
                 data=np.array([[123, 222],
                                [123, 333],
                                [345, 444],
                                [345, 555]
                                ]))

dfa['ids'] = ''  #create ids column in dfa

for x in dfa.itertuples():

    list = []
    for a in dfb.itertuples():

        if x[1] == a[1]: 
            print(a[2])
            list.append(a[2])
            #x[1] shows first column items from dfa, a[1] from dfb
            #if the values match
            #get value from column['subject id'] in dfb and add to an empty list
    
    slist = str(list) #change list to string
    dfa.loc[x[0], ['ids']] = slist #append to 'id' column at the index where the values match
    print(dfa)

i dont know how to quote the output of the table but the code above is copy paste aside from the imports

I tried to keep the list format to no avail. Tried using lamda functions and setting the column .astype(object, included dtype=object in the dataframe. Along with a bunch of other ways.

if someone else knows how to keep the list as a list and add it to the dataframe using the code above I would love to know as well

Thank you! This looks promising. I will try it once I can. When you have names to your indexes, can you pass the column name instead of an index? — Daniel, Dec 09 '21 at 09:49
@Daniel yes you can, `dfa.set_index('number', inplace=True)` number being the column name — colk84, Dec 10 '21 at 00:01

score 0 · Answer 2 · answered Dec 09 '21 at 22:58

I found a solution in this post: How to implode(reverse of pandas explode) based on a column

I simply grouped by the number column, added the values to the list, and merged the data frames.

Here is the code if somebody needs it:

def create_subject_id_column(complaint_df, subject_df, subject_column="subject", number_column="number"):
    subject_df = subject_df.copy()
    subject_df.drop(subject_column, axis=1, inplace=True)
    subject_df = (subject_df.groupby(number_column)
      .agg({'subject_id': lambda x: x.tolist()})
      .reset_index())
    combined_df = complaint_df.merge(subject_df, how="outer", on=number_column)
    return combined_df

How can I join two dataframes where one column holds two or more values

2 Answers2