3

I have a CSV file that contains rows with some similar id's. I found a nice approach doing this with dataframe and I found the code doing this from [this] 1 post.

Sample CSV file:

    id               messages
0   11  I am not driving home
1   11      Please pick me up
2   11     I don't have money
3  103   The car already park
4  103     No need for ticket
5  104       I will buy a car
6  104       I will buy a car

Desired output is:

Sample CSV file:

id   messages        
011   I am not driving home Please pick me up I don't have money     
103   The car already park No need for ticket         
104   I will buy a car              

Now the code that I have so far is:

aggregation_functions = {'message':'sum'}
df_new = df.groupby(df['id']).aggregate(aggregation_functions)

Now what I am getting with this code is:

id   messages        
011   I am not driving homePlease pick me upI don't have money      
103   The car already parkNo need for ticket         
104   I will buy a car 

I just want to have the space between words (eg. "homePlease" > "home Please") and avoid redundancy such as having two times of I will buy a car.

I already checked the post 2 but I couldn't find my answer.

Also do i need to use .reindex(columns=df.columns) after the aggregate(aggregation_functions)

Like:

df_new = df.groupby(df['id']).aggregate(aggregation_functions).reindex(columns=df.columns)
micstr
  • 5,080
  • 8
  • 48
  • 76
Bilgin
  • 499
  • 1
  • 10
  • 25

3 Answers3

3

To remove redundancy, I suggest GroupBy.unique followed by str.join:

df.groupby('id')['messages'].unique().str.join(' ')

Alternatively, using GroupBy.agg with set + ' '.join:

df.groupby('id')['messages'].agg(lambda x: ' '.join(set(x)))

Both print,

# id
# 11     I don't have money I am not driving home Pleas...
# 103              No need for ticket The car already park
# 104                                     I will buy a car
# Name: messages, dtype: object

To return a DataFrame, call reset_index at the end... for example,

df.groupby('id')['messages'].unique().str.join(' ').reset_index()

#     id                                           messages
# 0   11  I am not driving home Please pick me up I don'...
# 1  103            The car already park No need for ticket
# 2  104                                   I will buy a car
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Thanks for your comment. I need to have the header `id` and `message`, because I will use them in my dataframe. But based on your code and output, I am not able to have the header. Can you help. Thanks – Bilgin May 31 '19 at 02:25
  • @Bilgin It's as simple as `result = df.groupby('id')['messages'].unique().str.join(' ').reset_index()` – cs95 May 31 '19 at 02:27
  • thanks. Sorry, i am new in python. how about if I use `df.groupby('id')['messages'].unique().str.join(' ')`? is it the same? – Bilgin May 31 '19 at 02:29
  • @Bilgin That returns a Series with ID as the index. If you call `.reset_index()` on that result, it converts the index into a column, and the overall result is a DataFrame. Take a look at the code and let me know if there are other issues. – cs95 May 31 '19 at 02:35
  • I understand now. Thanks. Now it is returning the header. One last question is that how can I get rid of that index number on the left `0,1,2,..`. I am doing sum results generated and saving them into CSV file and those numbers appearing in the CSV file with no header and it is confusing. Thanks for your time – Bilgin May 31 '19 at 02:44
  • @Bilgin Add a parameter `index=False` when you call `df.to_csv`. – cs95 May 31 '19 at 03:02
  • Worked perfectly. Thanks – Bilgin May 31 '19 at 03:05
3

So it will be drop_duplicates first , the agg join

df.drop_duplicates().groupby('id',as_index=False).messages.agg(' '.join)
BENY
  • 317,841
  • 20
  • 164
  • 234
2

you can better use apply with join:

>>> df
    id               messages
0   11  I am not driving home
1   11      Please pick me up
2   11     I don't have money
3  103   The car already park
4  103     No need for ticket
5  104       I will buy a car
6  104       I will buy a car

>>> df.groupby('id')['messages'].apply(lambda x: ' '.join(x))
id
11     I am not driving home Please pick me up I don'...
103              The car already park No need for ticket
104                    I will buy a car I will buy a car
Name: messages, dtype: object
Hackaholic
  • 19,069
  • 5
  • 54
  • 72