I have a DataFrame like below
link_id uploaded_timestamp file_path
ScUwGwuS1k 2020-07-21 00:00:00 UTC /home/user/Docs/file.pdf
ScUwGwuS1k 2020-12-16 00:00:00 UTC /home/user/Downloads/file.pdf
In the given data set I have multiple cases where link_id's are the same, but timestamps and file_paths for them differ. My input dataset has about 13k+ rows.
I would like to now distinct only one row for each of the link_ids based on the latest uploaded_timestemp.
What I already came up with is as follows:
new_df = pd.read_csv('input/input_csf_file.csv')
idx = new_df.groupby(['link_id'])
idx = idx.max()
However, when I checked randomly some id's, it looks like, the values for uploaded_timestamp was taken the latest, but file_path was not relevant, like below:
link_id uploaded_timestamp file_path
ScUwGwuS1k 2020-12-16 00:00:00 UTC /home/user/Docs/file.pdf
Could you please help me with that?