Hi as I am new to python, a friend recommended me to seek help on stackoverflow, so I decided to give it a shot. I'm currently using python version 3.x.
I have over 100k of data set in a csv file with no column header, I have loaded the data into pandas DataFrame
.
Due to the fact that the documents are confidential I cant display the data here
but this is an example of the data and column that can be define as below
("id", "name", "number", "time", "text_id", "text", "text")
1 | apple | 12 | 123 | 2 | abc | abc
1 | apple | 12 | 222 | 2 | abc | abc
2 | orange | 32 | 123 | 2 | abc | abc
2 | orange | 11 | 123 | 2 | abc | abc
3 | apple | 12 | 333 | 2 | abc | abc
3 | apple | 12 | 443 | 2 | abc | abc
3 | apple | 12 | 553 | 2 | abc | abc
As you can see from the name
column, I have 2 duplicates clusters of "apple" but with different ID.
so my question is: how do I drop the entire cluster (rows) that has a higher mean value base on "time"?
Example: if (cluster with ID: 1).mean(time) < (cluster with ID: 3).mean(time) then drop all the rows in cluster with ID: 3
Desired output:
1 | apple | 12 | 123 | 2 | abc | abc
1 | apple | 12 | 222 | 2 | abc | abc
2 | orange | 32 | 123 | 2 | abc | abc
2 | orange | 11 | 123 | 2 | abc | abc
I need a lot of help and any that I can get, I'm running out of time, thanks in advance!