Aggregate rows based on similarities in strings

Question

I am gathering insights on a platform that automates messages, and I'd like to group them by how often that type of message occurs. Some of these messages are syntactically similar, others are not.

For example, my pandas dataframe currently looks something like this:

 message                                   | count
-------------------------------------------|-------
"Happy Birthday!"                          | 50
"Good luck on your first day of school!"   | 44
"Sent comms on 04042020"                   | 3
"Sent comms on 05031996"                   | 1
...
"Sent comms on 06052021"                   | 1
"Sent comms on 11042020"                   | 1
"Sent comms on 07202014"                   | 1

What I would like is to condense some rows together, so that all the rows I know are essentially various metadata relating to comms are still counted as a single item, i.e:

 message                                   | count
-------------------------------------------|-------
"Sent comms on XXXXXXXX"                   | 94
"Happy Birthday!"                          | 50
"Good luck on your first day of school!"   | 44

Is there any functionality in Pandas to do this? I'm slightly familiar with the various functions for aggregation, but I'm not sure how to aggregate in a dataframe based on substrings/conditionals.

Thanks for any pointers.

You would have to in some way resolve those strings to a matching value, _then_ aggregate. To answer that question you'd likely need a more detailed definition of "similarity" — Henry Ecker, Jul 20 '21 at 13:50
Assuming there's some consistency with the messages (you mention it's an automated platform), to do the exact transformation you have above I'd use python's string translate method (https://docs.python.org/3/library/stdtypes.html#str.translate), mapping all digits to 'X'. To apply this on the dataframe use df["message"].apply(function). You can then do a group by / aggregate as you mention. — daveydave, Jul 20 '21 at 13:55
Here's an example: https://stackoverflow.com/a/12856384/15371702 more concise than I thought - didn't know about 'digits'! — daveydave, Jul 20 '21 at 13:59

score 0 · Answer 1 · answered Jul 20 '21 at 13:59

If you are interested for the beginning of the string, you can use something like below:

df.groupby(df['message'].str[:13]).sum()

This results to:

message
Good luck on      44
Happy Birthda     50
Sent comms on      7

Then you need to put some work in order to match the names of the groups, but for this you need to be more specific about what is considered as 'similarity' (eg how many characters at the beggining must much, etc)

Aggregate rows based on similarities in strings

1 Answers1