1

Hi as I am new to python, a friend recommended me to seek help on stackoverflow, so I decided to give it a shot. I'm currently using python version 3.x.

I have over 100k of data set in a csv file with no column header, I have loaded the data into pandas DataFrame. Due to the fact that the documents are confidential I cant display the data here but this is an example of the data and column that can be define as below

("id", "name", "number", "time", "text_id", "text", "text")

1 | apple | 12 | 123 | 2 | abc | abc

1 | apple | 12 | 222 | 2 | abc | abc

2 | orange | 32 | 123 | 2 | abc | abc

2 | orange | 11 | 123 | 2 | abc | abc

3 | apple | 12 | 333 | 2 | abc | abc

3 | apple | 12 | 443 | 2 | abc | abc

3 | apple | 12 | 553 | 2 | abc | abc

As you can see from the name column, I have 2 duplicates clusters of "apple" but with different ID.

so my question is: how do I drop the entire cluster (rows) that has a higher mean value base on "time"?

Example: if (cluster with ID: 1).mean(time) < (cluster with ID: 3).mean(time) then drop all the rows in cluster with ID: 3

Desired output:

1 | apple | 12 | 123 | 2 | abc | abc

1 | apple | 12 | 222 | 2 | abc | abc

2 | orange | 32 | 123 | 2 | abc | abc

2 | orange | 11 | 123 | 2 | abc | abc

I need a lot of help and any that I can get, I'm running out of time, thanks in advance!

2 Answers2

1

You can use groupby and apply to get the rows that you want to remove first. Then you can use take to obtain the final result.

import pandas as pd

## define the rows with higher than mean value
def my_func(df):
    return df[df['time'] > df['time'].mean()]

## get rows to removed
df1 = df.groupby(by='name', group_keys=False).apply(my_func)

## take only the row we want
index_to_keep = set(range(df.shape[0])) - set(df1.index)
df2 = df.take(list(index_to_keep))

Example:

## df
id    name  number  time  text_id text text1
0   1   apple      12   123        2  abc   abc
1   1   apple      12   222        2  abc   abc
2   2  orange      32   123        2  abc   abc
3   2  orange      11   123        2  abc   abc
4   3   apple      12   333        2  abc   abc
5   3   apple      12   444        2  abc   abc
6   3   apple      12   553        2  abc   abc

df1 = df.groupby(by='name', group_keys=False).apply(my_func)

## df1
id   name  number  time  text_id text text1
5   3  apple      12   444        2  abc   abc
6   3  apple      12   553        2  abc   abc

index_to_keep = set(range(df.shape[0])) - set(df1.index)
df2 = df.take(list(index_to_keep))

#index_to_keep
{0, 1, 2, 3, 4}

# df2
id    name  number  time  text_id text text1
0   1   apple      12   123        2  abc   abc
1   1   apple      12   222        2  abc   abc
2   2  orange      32   123        2  abc   abc
3   2  orange      11   123        2  abc   abc
4   3   apple      12   333        2  abc   abc

P.S I took the usage of take from this answer.

SSC
  • 1,311
  • 5
  • 18
  • 29
  • Hi @SCC, thank you for the reply but erm however what I'm looking for is that the index_to_keep is {0, 1, 2, 3} 4 have to be dropped as well as it belongs to the cluster with ID:3 Is there a way to calculate by the cluster mean value (based on time)? so if the cluster with a higher mean value = drop Example: (cluster with ID:1).mean(time) < (cluster with ID: 3).mean(time) = drop all rows with cluster ID: 3 –  Nov 04 '17 at 12:06
  • You can adjust the condition in the `my_func()` in my example by change `>` to `>=` or other conditions depends on your requirement. – SSC Nov 05 '17 at 00:40
0

What you need are these things:

Try the following:

import pandas as pd

df = pd.read_csv('filename.csv', header=None)
df.columns = ['id', 'name', 'number', 'time', 'text_id', 'text', 'text']

print(df)

for eachname in df.name.unique():
    eachname_df = df.loc[df['name'] == eachname]
    grouped_df = eachname_df.groupby(['id', 'name'])
    avg_name = grouped_df['time'].mean()

    for a, b in grouped_df:
        if b['time'].mean() != avg_name.min():
            indextodrop = b.index.get_values()
            for eachindex in indextodrop:
                df = df.drop([eachindex])

print(df)


Result:
   id    name  number  time  text_id text text
0   1   apple      12   123        2  abc  abc
1   1   apple      12   222        2  abc  abc
2   2  orange      32   123        2  abc  abc
3   2  orange      11   123        2  abc  abc
4   3   apple      12   333        2  abc  abc
5   3   apple      12   443        2  abc  abc
6   3   apple      12   553        2  abc  abc

   id    name  number  time  text_id text text
0   1   apple      12   123        2  abc  abc
1   1   apple      12   222        2  abc  abc
2   2  orange      32   123        2  abc  abc
3   2  orange      11   123        2  abc  abc
Joseph K.
  • 1,055
  • 3
  • 23
  • 46