Combine duplicate rows in Pandas

Question

I got a dataframe where some rows contains almost duplicate values. I'd like to combine these rows as much as possible to reduce the row numbers. Let's say I got following dataframe:

One	Two	Three
A	B	C
B	B	B
C	A	B

In this example I'd like the output to be:

One	Two	Three
ABC	AB	CB

The real dataframe got thousands of rows with eight columns.

The csv from a dataframe-sample:

Column_1,Column_2,Column_3,Column_4,Column_5,Column_6,Column_7,Column_8
A,A,A,A,A,A,A,A
A,A,A,A,A,A,A,B
A,A,A,A,A,A,A,C
A,A,A,A,A,A,B,A
A,A,A,A,A,A,B,B
A,A,A,A,A,A,B,C
A,A,A,A,A,A,C,A
A,A,A,A,A,A,C,B
A,A,A,A,A,A,C,C
C,C,C,C,C,C,A,A
C,C,C,C,C,C,A,B
C,C,C,C,C,C,A,C
C,C,C,C,C,C,B,A
C,C,C,C,C,C,B,B
C,C,C,C,C,C,B,C
C,C,C,C,C,C,C,A
C,C,C,C,C,C,C,B
C,C,C,C,C,C,C,C

To easier show how desired outcome woud look like:

Column_1,Column_2,Column_3,Column_4,Column_5,Column_6,Column_7,Column_8
AC,AC,AC,AC,AC,AC,ABC,ABC

I've tried some code but I end up in real long code snippets which I doubt could be the best and most natural solution. Any suggestions?

Does duplicated values appear only in one column or they may be in multiple ones? — mattiatantardini, Nov 04 '22 at 19:58
They could be in all different columns. It's a lot of different combinations of events which could either be A, B or C. So I'd like to combine these combinations to get the same combinations as the initial dataframe but reduce the number of rows by combining them. — OldSport, Nov 04 '22 at 20:30
What makes the first two rows "joinable" but not the third? You could look into [cluster analysis](https://en.wikipedia.org/wiki/Cluster_analysis) and define precisely your metrics and merging criteria. Or you could also look into [locality sensitive hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing). Basically, you need to tell us more about your merging decision making. — Pierre D, Nov 04 '22 at 20:54
Oh, but now you've adjusted the question such that all examples show _all_ the rows collapsed into one. Is that desired? If not, let me ask again: in which cases do you merge rows and when do you not? A precise definition is essential to understand the task. In addition think about the case of "chaining" (elongated clusters) where rows 1 and 2 would be merged, as well as 2 and 3, etc., but not 1 and 3. Then do you not merge some of them? Make several clusters? — Pierre D, Nov 05 '22 at 16:04

Sarah Bennett · Answer 1 · 2022-11-07T16:29:56.973

2

You can use something like this:

df = df.groupby(['Two'])['One','Three'].apply(''.join).reset_index()

If you can provide a small bit of code that creates the first df it'd be easier to try out solutions.

Also this other post may help: pandas - Merge nearly duplicate rows based on column value

EDIT:

Does this get you the output you're looking for?

joined_df = df.apply(''.join, axis=0)

variation of this: Concatenate all columns in a pandas dataframe

edited Nov 07 '22 at 16:29

answered Nov 04 '22 at 19:58

Sarah Bennett

21
2

why not copy the table and use pd.read_clipboard() – Chris Nov 04 '22 at 19:58
1

I've edited the first post to contain more samples from the dataframe to illustrate better. – OldSport Nov 04 '22 at 20:28
I did not know that was a thing, @Chris, very helpful! – Sarah Bennett Nov 07 '22 at 16:23

score 1 · Answer 2 · answered Nov 04 '22 at 20:41

1

If your data are all characters you can end up with this solution and collapse everything to one single row:

import pandas as pd

data = pd.read_csv("path/to/data")
collapsed = data.astype(str).sum().applymap(lambda x: ''.join(set(x)))

Check this answer on how to get unique characters in a string.

answered Nov 04 '22 at 20:41

mattiatantardini

525
1
5
23

It's a good try but doesn't seem to be working for me. Each cell can hold maximum 3 unique characters. – OldSport Nov 04 '22 at 23:50
Why cells can hold maximum 3 unique characters? It is a requirement from your use case? If you have more than 3 unique characters in your whole data, how to choose then the merging strategy? – mattiatantardini Nov 08 '22 at 13:25

Combine duplicate rows in Pandas

2 Answers2