Join strings by space using pandas aggregate function

Question

I have a CSV file that contains rows with some similar id's. I found a nice approach doing this with dataframe and I found the code doing this from [this] 1 post.

Sample CSV file:

    id               messages
0   11  I am not driving home
1   11      Please pick me up
2   11     I don't have money
3  103   The car already park
4  103     No need for ticket
5  104       I will buy a car
6  104       I will buy a car

Desired output is:

Sample CSV file:

id   messages        
011   I am not driving home Please pick me up I don't have money     
103   The car already park No need for ticket         
104   I will buy a car

Now the code that I have so far is:

aggregation_functions = {'message':'sum'}
df_new = df.groupby(df['id']).aggregate(aggregation_functions)

Now what I am getting with this code is:

id   messages        
011   I am not driving homePlease pick me upI don't have money      
103   The car already parkNo need for ticket         
104   I will buy a car

I just want to have the space between words (eg. "homePlease" > "home Please") and avoid redundancy such as having two times of I will buy a car.

I already checked the post 2 but I couldn't find my answer.

Also do i need to use .reindex(columns=df.columns) after the aggregate(aggregation_functions)

Like:

df_new = df.groupby(df['id']).aggregate(aggregation_functions).reindex(columns=df.columns)

@WeNYoBen Thank you. It worked. I also wondering if `reindex` is necessary to use? thanks — Bilgin, May 31 '19 at 01:10
Reindex has nothing to do here, that only changes the index/columns and/or its ordering. You want to aggregate the contents, not modify the structure. — cs95, May 31 '19 at 01:15

cs95 · Accepted Answer · 2019-05-31T03:07:52.487

3

To remove redundancy, I suggest GroupBy.unique followed by str.join:

df.groupby('id')['messages'].unique().str.join(' ')

Alternatively, using GroupBy.agg with set + ' '.join:

df.groupby('id')['messages'].agg(lambda x: ' '.join(set(x)))

Both print,

# id
# 11     I don't have money I am not driving home Pleas...
# 103              No need for ticket The car already park
# 104                                     I will buy a car
# Name: messages, dtype: object

To return a DataFrame, call reset_index at the end... for example,

df.groupby('id')['messages'].unique().str.join(' ').reset_index()

#     id                                           messages
# 0   11  I am not driving home Please pick me up I don'...
# 1  103            The car already park No need for ticket
# 2  104                                   I will buy a car

edited May 31 '19 at 03:07

answered May 31 '19 at 01:11

cs95

379,657
97
704
746

Thanks for your comment. I need to have the header `id` and `message`, because I will use them in my dataframe. But based on your code and output, I am not able to have the header. Can you help. Thanks – Bilgin May 31 '19 at 02:25
@Bilgin It's as simple as `result = df.groupby('id')['messages'].unique().str.join(' ').reset_index()` – cs95 May 31 '19 at 02:27
thanks. Sorry, i am new in python. how about if I use `df.groupby('id')['messages'].unique().str.join(' ')`? is it the same? – Bilgin May 31 '19 at 02:29
@Bilgin That returns a Series with ID as the index. If you call `.reset_index()` on that result, it converts the index into a column, and the overall result is a DataFrame. Take a look at the code and let me know if there are other issues. – cs95 May 31 '19 at 02:35
I understand now. Thanks. Now it is returning the header. One last question is that how can I get rid of that index number on the left `0,1,2,..`. I am doing sum results generated and saving them into CSV file and those numbers appearing in the CSV file with no header and it is confusing. Thanks for your time – Bilgin May 31 '19 at 02:44
@Bilgin Add a parameter `index=False` when you call `df.to_csv`. – cs95 May 31 '19 at 03:02
Worked perfectly. Thanks – Bilgin May 31 '19 at 03:05

score 3 · Answer 2 · answered May 31 '19 at 01:14

3

So it will be drop_duplicates first , the agg join

df.drop_duplicates().groupby('id',as_index=False).messages.agg(' '.join)

answered May 31 '19 at 01:14

BENY

317,841
20
164
234

score 2 · Answer 3 · answered May 31 '19 at 01:10

you can better use apply with join:

>>> df
    id               messages
0   11  I am not driving home
1   11      Please pick me up
2   11     I don't have money
3  103   The car already park
4  103     No need for ticket
5  104       I will buy a car
6  104       I will buy a car

>>> df.groupby('id')['messages'].apply(lambda x: ' '.join(x))
id
11     I am not driving home Please pick me up I don'...
103              The car already park No need for ticket
104                    I will buy a car I will buy a car
Name: messages, dtype: object

Join strings by space using pandas aggregate function

3 Answers3