1

I have a large group of data with various names and sources, in a large dataframe.

Reproducible data by Anshul Jain

First_Name  Last_Name   Source
      Matt      Jones       XX
     James      Smith       YY
     Smith     Weston       AA
    Weston  Supermare       CC
      Matt      Jones       YY
    Weston  Supermare       FF

# copy in with:
df = pd.read_clipboard(sep='\\s+')

The data looks as follows:

+------------+-----------+--------+
| First Name | Last Name | Source |
+------------+-----------+--------+
| Matt       | Jones     | XX     |
| James      | Smith     | YY     |
| Smith      | Weston    | AA     |
| Weston     | Supermare | CC     |
| Matt       | Jones     | YY     |
| Weston     | Supermare | FF     |
+------------+-----------+--------+

I need it to look like this:

+------------+-----------+--------+
| First Name | Last Name | Source |
+------------+-----------+--------+
| Matt       | Jones     | XX, YY |
| James      | Smith     | YY     |
| Smith      | Weston    | AA     |
| Weston     | Supermare | CC, FF |
+------------+-----------+--------+

I can get the deduplication process to work using:

Conn_df = Conn_df.drop_duplicates(subset=['First Name', 'Last Name'])

However, before I deduplicate, I need to record all the sources for the same data on the same row.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Clauric
  • 1,826
  • 8
  • 29
  • 48
  • 2
    That looks nice, but is nearly impossible to get back into a dataframe without manually retyping everything. Please [provide a reproducible copy of the DataFrame with `df.to_clipboard(sep=',')`](https://stackoverflow.com/questions/52413246/how-to-provide-a-copy-of-your-dataframe-with-to-clipboard) – Trenton McKinney May 12 '20 at 19:25

2 Answers2

3

You can use DataFrame.groupby to group the dataframe by the columns First Name and Last Name and then apply the agg function join on the Source column.

Use:

result = Conn_df.groupby(["First Name", "Last Name"])["Source"].agg(', '.join).reset_index()
print(result)

This prints:

  First Name  Last Name  Source
0      James      Smith      YY
1       Matt      Jones  XX, YY
2      Smith     Weston      AA
3     Weston  Supermare  CC, FF
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
2

Do this:

print(df.groupby(by=['FirstName', 'LastName']).agg(lambda x: ', '.join(x)))
FirstName LastName
James     Smith          YY
Matt      Jones      XX, YY
Smith     Weston         AA
Weston    Supermare  CC, FF
NYC Coder
  • 7,424
  • 2
  • 11
  • 24
  • Hi, Thanks for the suggested solution. The "Source" field gets populated by "None" for every row, using the above. – Clauric May 13 '20 at 08:37