I have two dataframes similar to this:
A = pd.DataFrame(data={"number": [123, 345], "subject_ids": []})
B = pd.DataFrame(data={"number": [123, 123, 345, 345], "subject_id": [222, 333, 444, 555]})
Meaning: Every number has at least two subject ids.
How would you go about merging these dataframes together, so there would be column "subject_ids" in the A dataframe containing joined list of ids in one cell?
"number": [123, 345], "subject_ids": [[222, 333], [444, 555]]
I've tried lots of methods like this:
A.merge(B, how='left', on='number')
But nothing seems to work. (I couldn't find an answer to this either)
The number is a key and those keys are identical, and the second df stores subjects to those numbers. I want the A dataframe to contain a reference to those subject IDs in a list assigned to one row with that given number. One number can have many subjects.
Complaint dataframe where I want the column with list of all subject IDs associated with the number:
number total_complaint_count first_complaint_on last_complaint_on
0 0000000000 77 2021-10-29 2021-12-05
77 00000000000 1 2021-11-12 2021-11-12
78 000000000000 1 2021-11-07 2021-11-07
79 00020056234 1 2021-11-23 2021-11-23
80 0002266648 1 2021-11-02 2021-11-02
Subject dataframe that contains the number to be associated with, subject and subject ID.
number subject \
787 0000000000 Other
4391 0000000000 Calls pretending to be government, businesses,...
694 0000000000 Warranties & protection plans
1106 0000000000 Other
4682 0000000000 Dropped call or no message
subject_id
787 38d1177e-51e8-4cec-aef8-0112f425091b
4391 1964fb22-bd20-4d49-beaf-51322a5f5bad
694 07819535-41b0-44f3-a497-ac2cee16dd1a
1106 2f348025-3f9f-4861-b151-fbb8a1ac14a3
4682 15d33ca0-6d90-42ba-9a1d-74e0dcf28539
Info of both dataframes:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 230122 entries, 0 to 281716
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 number 230122 non-null object
1 total_complaint_count 230122 non-null int64
2 first_complaint_on 19 non-null object
3 last_complaint_on 19 non-null object
dtypes: int64(1), object(3)
memory usage: 8.8+ MB
---------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 281720 entries, 787 to 9377
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 number 281720 non-null object
1 subject 281720 non-null object
2 subject_id 281720 non-null object
dtypes: object(3)
memory usage: 8.6+ MB