4

Here is the example data set

id         firstname    lastname      email               update date
A1         wendy         smith         ws@mail.com        2018-01-02
A1         wendy         smith         smith@mail.com     2019-02-03 
A2         harry         lynn          lynn@mail.com      2016-04-03
A2         harry                       harylynn@mail.com  2019-03-12
A3         tinna         dickey        tinna@mail.com     2016-04-03
A3         tinna         dickey        tinna@mail.com     2013-06-12
A4         Tom           Lee           Tom@mail.com       2012-06-12
A5         Ella                        Ella@mail.com      2019-07-12
A6         Ben           Lang          Ben@mail.com       2019-03-12

I have sorted the data set by id and update date, I want to merge the rows with same id, if one row with empty value, fill the other one with same id, if confilct, use the latest one. For rows with no duplicate id leave the empty cell as it is.

the output should be:

id         firstname    lastname      email               update date
A1         wendy         smith         smith@mail.com     2019-02-03 
A2         harry         lynn          harylynn@mail.com  2019-03-12
A3         tinna         dickey        tinna@mail.com     2019-03-12
A4         Tom           Lee           Tom@mail.com       2012-06-12
A5         Ella                        Ella@mail.com      2019-07-12
A6         Ben           Lang          Ben@mail.com       2019-03-12

my attempt was using ffill() to merge rows with empty and keep last duplicate, but the result seems to affect other cells which should have empty values(like lastname in A5 should be empty ).

df=df.ffill().drop_duplicates('id',keep='last')
Erfan
  • 40,971
  • 8
  • 66
  • 78
user2748930
  • 83
  • 1
  • 2
  • 8

4 Answers4

4

Use GroupBy.ffill to only forward fill for the same group. Then use drop_duplicates:

df['lastname'] = df.groupby('id')['lastname'].ffill()
df = df.drop_duplicates('id', keep='last')

Or in one line (but less readable in my opinion), using assign:

df.assign(lastname=df.groupby('id')['lastname'].ffill()).drop_duplicates('id', keep='last')

Output

   id firstname lastname              email update date
1  A1     wendy    smith     smith@mail.com  2019-02-03
3  A2     harry     lynn  harylynn@mail.com  2019-03-12
5  A3     tinna   dickey     tinna@mail.com  2013-06-12
6  A4       Tom      Lee       Tom@mail.com  2012-06-12
7  A5      Ella      NaN      Ella@mail.com  2019-07-12
8  A6       Ben     Lang       Ben@mail.com  2019-03-12
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • 1
    if other columns contain empty values that need to be grouped, can I add more lines like `df['email'] = df.groupby('id')['email'].ffill()` ... before drop duplicates? Thanks – user2748930 Oct 03 '19 at 14:32
1

Use

Ex.

df = df.replace('',np.nan, regex=True)
df1 = df.groupby('id',as_index=False,sort=False).last()
print(df1)

   id firstname lastname              email  updatedate
0  A1     wendy    smith     smith@mail.com  2019-02-03
1  A2     harry     lynn  harylynn@mail.com  2019-03-12
2  A3     tinna   dickey     tinna@mail.com  2013-06-12
3  A4       Tom      Lee       Tom@mail.com  2012-06-12
4  A5      Ella      NaN      Ella@mail.com  2019-07-12
5  A6       Ben     Lang       Ben@mail.com  2019-03-12
bharatk
  • 4,202
  • 5
  • 16
  • 30
1

Try this:

df.groupby('id').ffill().drop_duplicates('id', keep='last')

output:

   id firstname lastname              email  update date
1  A1     wendy    smith     smith@mail.com  2019-02-03 
3  A2     harry     lynn  harylynn@mail.com   2019-03-12
5  A3     tinna   dickey     tinna@mail.com   2013-06-12
6  A4       Tom      Lee       Tom@mail.com   2012-06-12
7  A5      Ella      NaN      Ella@mail.com   2019-07-12
8  A6       Ben     Lang       Ben@mail.com   2019-03-12
Divya Dass
  • 166
  • 1
  • 10
  • because of `ffill()` , `id` won't be set as an index – Divya Dass Oct 03 '19 at 12:30
  • 1
    @ Divya Dass When I tried your code, I got an error `KeyError: Index(['id'], dtype='object')` Do you know what might cause the error? thanks – user2748930 Oct 03 '19 at 14:27
  • @user2748930 Not sure why you are getting this error. Erfan also stated to have got this error earlier. This error means that `id` has become an index in your case. In my case it had become a column after I had used ffill(). Therefore, I used drop_duplicates() after this. In your case `df.groupby('id').ffill().reset_index().drop_duplicates('id', keep='last')` might help. This will make `id` a column if it has become index in your code. – Divya Dass Oct 03 '19 at 15:20
0

Use a combination of groupby, apply, and iloc:

df.groupby('id', as_index=False).apply(lambda x: x.fillna(method='ffill').iloc[0])

   id firstname lastname              email  update date
0  A1     wendy    smith     smith@mail.com  2019-02-03
1  A2     harry     lynn  harylynn@mail.com  2019-03-12
2  A3     tinna   dickey     tinna@mail.com  2019-03-12
3  A4       Tom      Lee       Tom@mail.com  2019-06-12
4  A5      Ella      NaN      Ella@mail.com  2019-07-12
5  A6       Ben     Lang       Ben@mail.com  2019-03-12
  • groupby groups the dataframe by unique ids
  • fillna fills all the NaN values with the row with non-NaN values
  • iloc[-1] gets you the row with the latest data
adrianp
  • 999
  • 1
  • 8
  • 13