0

I have a DataFrame like below

link_id     uploaded_timestamp       file_path
ScUwGwuS1k  2020-07-21 00:00:00 UTC  /home/user/Docs/file.pdf
ScUwGwuS1k  2020-12-16 00:00:00 UTC  /home/user/Downloads/file.pdf

In the given data set I have multiple cases where link_id's are the same, but timestamps and file_paths for them differ. My input dataset has about 13k+ rows.

I would like to now distinct only one row for each of the link_ids based on the latest uploaded_timestemp.

What I already came up with is as follows:

new_df = pd.read_csv('input/input_csf_file.csv')

idx = new_df.groupby(['link_id'])
idx = idx.max()

However, when I checked randomly some id's, it looks like, the values for uploaded_timestamp was taken the latest, but file_path was not relevant, like below:

link_id     uploaded_timestamp       file_path
ScUwGwuS1k  2020-12-16 00:00:00 UTC  /home/user/Docs/file.pdf

Could you please help me with that?

bugZ
  • 466
  • 5
  • 19

0 Answers0