Pandas using group by on one column, then max on the other get non duplicated values

Asked Dec 21 '20 at 09:24

Active Dec 21 '20 at 09:24

Viewed 5 times

I have a DataFrame like below

link_id     uploaded_timestamp       file_path
ScUwGwuS1k  2020-07-21 00:00:00 UTC  /home/user/Docs/file.pdf
ScUwGwuS1k  2020-12-16 00:00:00 UTC  /home/user/Downloads/file.pdf

In the given data set I have multiple cases where link_id's are the same, but timestamps and file_paths for them differ. My input dataset has about 13k+ rows.

I would like to now distinct only one row for each of the link_ids based on the latest uploaded_timestemp.

What I already came up with is as follows:

new_df = pd.read_csv('input/input_csf_file.csv')

idx = new_df.groupby(['link_id'])
idx = idx.max()

However, when I checked randomly some id's, it looks like, the values for uploaded_timestamp was taken the latest, but file_path was not relevant, like below:

link_id     uploaded_timestamp       file_path
ScUwGwuS1k  2020-12-16 00:00:00 UTC  /home/user/Docs/file.pdf

Could you please help me with that?

asked Dec 21 '20 at 09:24

bugZ

1

Use `df = new_df.loc[new_df.groupby('link_id')['uploaded_timestamp'].idxmax()]` – jezrael Dec 21 '20 at 09:26
Your solution gives me a ValueError exception – bugZ Dec 21 '20 at 09:32
Is converted `uploaded_timestamp` to datetimes ? – jezrael Dec 21 '20 at 09:33
1

I haven't done it previously. After conversion, it works perfectly! Thank you very much! – bugZ Dec 21 '20 at 09:47

Pandas using group by on one column, then max on the other get non duplicated values

0 Answers0