1

I need to cleanup some data in a Pandas DataFrame and struggling with this.

Sample data:

Date       | ID     | Name             | Address
-----------------------------------------------------------------------------------------------
1-4-1987   | 124578 | T.Hilpert        | 518 Hessel Plaza Lake Lonzo, AZ 11863
23-6-1990  | 947383 | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury
12-5-1960  | 746732 | Earline Schulist | 57367 Alfredo Vista East Bertaburgh
9-9-2010   | 947383 | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury, WV 16241-5205
27-12-2017 | 124578 | Theresia Hilpert | 518 Hessel Plaza Lake Lonzo

What I want to do is this. Group by ID, get the name from the most recent date AND get the longest address string. Use these for all occurrences of the ID (in two new columns: Name_new and Address_New). Pleas find desired sample below:

Date       | ID     | Name             | Address                                                | Name_New         | Address_New
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
27-12-2017 | 124578 | Theresia Hilpert | 518 Hessel Plaza Lake Lonzo                            | Theresia Hilpert | 518 Hessel Plaza Lake Lonzo, AZ 11863
1-4-1987   | 124578 | T. Hilpert       | 518 Hessel Plaza Lake Lonzo, AZ 11863                  | Theresia Hilpert | 518 Hessel Plaza Lake Lonzo, AZ 11863
23-6-1990  | 947383 | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury                | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury, WV 16241-5205
9-9-2010   | 947383 | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury, WV 16241-5205 | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury, WV 16241-5205
12-5-1960  | 746732 | Earline Schulist | 57367 Alfredo Vista East Bertaburgh                    | Earline Schulist | 57367 Alfredo Vista East Bertaburgh

I've tried this but cant get it combined to get the desired result.

def f1(s):
    return max(s, key=len)

df_new = df['New_Address'] = df.groupby('ID').agg({'Address': f1})


df_new = df[df.groupby('ID').Date.transform('max') == df['Date']]

Your help is particularly appreciated.

John Doe
  • 9,843
  • 13
  • 42
  • 73

1 Answers1

1

Use transform for return Series with same size like original DataFrame, then create index by Name column and get value by maximal Date by idxmax:

df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y')
df['Address_New'] = df.groupby('ID')['Address'].transform(lambda s: max(s, key=len))
df['Name_New'] = df.set_index('Name').groupby('ID')['Date'].transform('idxmax').values
print (df)
        Date      ID              Name  \
0 1987-04-01  124578         T.Hilpert   
1 1990-06-23  947383   Birdie Reynolds   
2 1960-05-12  746732  Earline Schulist   
3 2010-09-09  947383   Birdie Reynolds   
4 2017-12-27  124578  Theresia Hilpert   

                                             Address  \
0              518 Hessel Plaza Lake Lonzo, AZ 11863   
1            964 Weissnat Green Suite 568 Rennerbury   
2                57367 Alfredo Vista East Bertaburgh   
3  964 Weissnat Green Suite 568 Rennerbury, WV 16...   
4                        518 Hessel Plaza Lake Lonzo   

                                         Address_New          Name_New  
0              518 Hessel Plaza Lake Lonzo, AZ 11863  Theresia Hilpert  
1  964 Weissnat Green Suite 568 Rennerbury, WV 16...   Birdie Reynolds  
2                57367 Alfredo Vista East Bertaburgh  Earline Schulist  
3  964 Weissnat Green Suite 568 Rennerbury, WV 16...   Birdie Reynolds  
4              518 Hessel Plaza Lake Lonzo, AZ 11863  Theresia Hilpert  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I managed to change the addresses to the new `New_Address` column. I'd like to do the same for the Name (not creating a new DataFrame. I used it just for testing my code. I tried `df['New_Name'] = df.groupby('ID')['Name'].Date.transform('max') == df['Date']`. But then I've an error : `AttributeError: 'SeriesGroupBy' object has no attribute 'Date'` – John Doe Dec 18 '18 at 08:46
  • @JohnDoe - I think you are close, need `df['New_Name'] = df.groupby('ID')['Name'].transform(lambda s: max(s, key=len))` – jezrael Dec 18 '18 at 08:49
  • I want to get the name from the most recent date. For the address it's just the longest string. – John Doe Dec 18 '18 at 08:53
  • @JohnDoe - Not sure if understand - `df.groupby('ID').Date.transform('max')` return latest dates, `df.groupby('ID').Name.transform('max')` working too, it return `returns the one that is at the bottom of the alphabetic list` (see [this](https://stackoverflow.com/a/50662475)) So need maximal length? Can you explain more? – jezrael Dec 18 '18 at 08:58
  • This is my code: `def f1(s): return max(s, key=len) df['New_Name'] = df.groupby('ID')['Name'].transform(lambda s: max(s, key=len)) df['New_Address'] = df.groupby('ID')['Address'].transform(f1)` For address it's just fine. For `Name` I need the most recent name (The name from the most recent date) Now it returns the longest name and not the most recent (please see example `27-12-2017 | 124578 | Theresia Hilpert` and `1-4-1987 | 124578 | Theresia Hilpert | 518 Hessel Plaza Lake Lonzo, AZ 11863` – John Doe Dec 18 '18 at 09:10
  • 1
    Many thanks. It works excellently!! And also thanks for the links for the explanation, – John Doe Dec 18 '18 at 09:26