Python 3.x how to keep a group of duplicate with smaller mean value?

Question

Hi as I am new to python, a friend recommended me to seek help on stackoverflow, so I decided to give it a shot. I'm currently using python version 3.x.

I have over 100k of data set in a csv file with no column header, I have loaded the data into pandas DataFrame. Due to the fact that the documents are confidential I cant display the data here but this is an example of the data and column that can be define as below

("id", "name", "number", "time", "text_id", "text", "text")

1 | apple | 12 | 123 | 2 | abc | abc

1 | apple | 12 | 222 | 2 | abc | abc

2 | orange | 32 | 123 | 2 | abc | abc

2 | orange | 11 | 123 | 2 | abc | abc

3 | apple | 12 | 333 | 2 | abc | abc

3 | apple | 12 | 443 | 2 | abc | abc

3 | apple | 12 | 553 | 2 | abc | abc

As you can see from the name column, I have 2 duplicates clusters of "apple" but with different ID.

so my question is: how do I drop the entire cluster (rows) that has a higher mean value base on "time"?

Example: if (cluster with ID: 1).mean(time) < (cluster with ID: 3).mean(time) then drop all the rows in cluster with ID: 3

Desired output:

1 | apple | 12 | 123 | 2 | abc | abc

1 | apple | 12 | 222 | 2 | abc | abc

2 | orange | 32 | 123 | 2 | abc | abc

2 | orange | 11 | 123 | 2 | abc | abc

I need a lot of help and any that I can get, I'm running out of time, thanks in advance!

SSC · Answer 1 · 2017-11-04T09:15:03.983

You can use groupby and apply to get the rows that you want to remove first. Then you can use take to obtain the final result.

import pandas as pd

## define the rows with higher than mean value
def my_func(df):
    return df[df['time'] > df['time'].mean()]

## get rows to removed
df1 = df.groupby(by='name', group_keys=False).apply(my_func)

## take only the row we want
index_to_keep = set(range(df.shape[0])) - set(df1.index)
df2 = df.take(list(index_to_keep))

Example:

## df
id    name  number  time  text_id text text1
0   1   apple      12   123        2  abc   abc
1   1   apple      12   222        2  abc   abc
2   2  orange      32   123        2  abc   abc
3   2  orange      11   123        2  abc   abc
4   3   apple      12   333        2  abc   abc
5   3   apple      12   444        2  abc   abc
6   3   apple      12   553        2  abc   abc

df1 = df.groupby(by='name', group_keys=False).apply(my_func)

## df1
id   name  number  time  text_id text text1
5   3  apple      12   444        2  abc   abc
6   3  apple      12   553        2  abc   abc

index_to_keep = set(range(df.shape[0])) - set(df1.index)
df2 = df.take(list(index_to_keep))

#index_to_keep
{0, 1, 2, 3, 4}

# df2
id    name  number  time  text_id text text1
0   1   apple      12   123        2  abc   abc
1   1   apple      12   222        2  abc   abc
2   2  orange      32   123        2  abc   abc
3   2  orange      11   123        2  abc   abc
4   3   apple      12   333        2  abc   abc

P.S I took the usage of take from this answer.

Hi @SCC, thank you for the reply but erm however what I'm looking for is that the index_to_keep is {0, 1, 2, 3} 4 have to be dropped as well as it belongs to the cluster with ID:3 Is there a way to calculate by the cluster mean value (based on time)? so if the cluster with a higher mean value = drop Example: (cluster with ID:1).mean(time) < (cluster with ID: 3).mean(time) = drop all rows with cluster ID: 3 — , Nov 04 '17 at 12:06
You can adjust the condition in the `my_func()` in my example by change `>` to `>=` or other conditions depends on your requirement. — SSC, Nov 05 '17 at 00:40

Joseph K. · Accepted Answer · 2017-11-04T20:25:36.760

What you need are these things:

Try the following:

import pandas as pd

df = pd.read_csv('filename.csv', header=None)
df.columns = ['id', 'name', 'number', 'time', 'text_id', 'text', 'text']

print(df)

for eachname in df.name.unique():
    eachname_df = df.loc[df['name'] == eachname]
    grouped_df = eachname_df.groupby(['id', 'name'])
    avg_name = grouped_df['time'].mean()

    for a, b in grouped_df:
        if b['time'].mean() != avg_name.min():
            indextodrop = b.index.get_values()
            for eachindex in indextodrop:
                df = df.drop([eachindex])

print(df)


Result:
   id    name  number  time  text_id text text
0   1   apple      12   123        2  abc  abc
1   1   apple      12   222        2  abc  abc
2   2  orange      32   123        2  abc  abc
3   2  orange      11   123        2  abc  abc
4   3   apple      12   333        2  abc  abc
5   3   apple      12   443        2  abc  abc
6   3   apple      12   553        2  abc  abc

   id    name  number  time  text_id text text
0   1   apple      12   123        2  abc  abc
1   1   apple      12   222        2  abc  abc
2   2  orange      32   123        2  abc  abc
3   2  orange      11   123        2  abc  abc

Python 3.x how to keep a group of duplicate with smaller mean value?

2 Answers2