1

Before I start, I am not sure which terminology can I use, so please dont be harsh with mine misuse of terms like unique and duplicate.

Pandas dataset has three columns, A B and C. Rows are considered the same if they have either of the columns, A, B or C the same. If we have this table:

row num A B C
1 A1 B1 C1
2 A1 B2 C2
3 A2 B2 C3
4 A3 B3 C3

row 1 and 2 are the same because column A is the same, row 2 and 3 because of B, and row 3 and 4 because of C. This would mean that since 1 is duplicate of 2 which is duplicate of 3 which is duplicate of 4, I expect the number of unique records here is 1.

How would I write python pandas code to calculate that?

Sale
  • 349
  • 1
  • 3
  • 15

2 Answers2

2

This looks like graph problem (similar to this question: Merge lists that share common elements)

Here is my take on the problem:

Initial dataframe (I will use index, not row num):

   row num   A   B   C
0        1  A1  B1  C1
1        2  A1  B2  C2
2        3  A2  B2  C3
3        4  A3  B3  C3
4        5  A4  B4  C4
5        6  A5  B5  C5
6        7  A6  B4  C6

First we go for each column we want to investigate the duplicates:

groups = []
for c in ["A", "B", "C"]:
    d = df.duplicated(subset=c, keep=False)
    groups.extend([set(g.index) for _, g in df.groupby(df.loc[d, c])])

This creates groups that need to be merged on common elements:

[{0, 1}, {1, 2}, {4, 6}, {2, 3}]

Here we see that groups {0, 1}, {1, 2}, {2, 3} share common elements and, group {4, 6} is separated. To compute that:

out = []
while groups:
    g1 = groups.pop()
    while True:
        for g2 in groups:
            if g1 & g2:
                groups.remove(g2)
                g1 |= g2
                break
        else:
            out.append(g1)
            break

This creates out:

[{0, 1, 2, 3}, {4, 6}]

So there are 2 unique groups (= 2 unique values). But index 5 is missing:

print(df.index.difference(set.union(*out)))

Prints:

Index([5], dtype='int64')

So the final number is len(out) + length of computed index

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • this seems to be an answer, as it has shown all correct answers for my sample dataset, only that it takes ages to execute on a bigger dataset. Thanks! I just need to wait – Sale Aug 28 '23 at 18:33
  • @Sale If speed is a problem, in the link I have in my answer there's accepted solution that uses `networkx` library (it should be faster than my computation of groups). You can give it a shot. – Andrej Kesely Aug 28 '23 at 18:53
  • @Sale I add a link how to further speed up the `connected_components` method from `networkx` library: https://zpz.github.io/blog/connected-components/ – Andrej Kesely Aug 28 '23 at 19:14
1

Update: with your new example, using networkx

from numpy.lib.stride_tricks import sliding_window_view as swv
import networkx as nx
import pandas as pd

data = {'A': ['A1', 'A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A1', 'A7', 'A8'],
        'B': ['B1', 'B2', 'B2', 'B3', 'B4', 'B4', 'B4', 'B4', 'B4', 'B4'],
        'C': ['C1', 'C2', 'C3', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9']}
dft = pd.DataFrame(data)

cols = ['A', 'B', 'C']
edges = pd.DataFrame(swv(dft[cols].values.ravel(), 2),
                     columns=['source', 'target'])
G = nx.from_pandas_edgelist(edges, create_using=nx.DiGraph)
uniq = nx.weakly_connected.number_weakly_connected_components(G)

Output:

>>> uniq
1

>>> edges
   source target
0      A1     B1
1      B1     C1
2      C1     A1
3      A1     B2
4      B2     C2
5      C2     A2
6      A2     B2
7      B2     C3
8      C3     A3
9      A3     B3
10     B3     C3
11     C3     A4
12     A4     B4
13     B4     C4
14     C4     A5
15     A5     B4
16     B4     C5
17     C5     A6
18     A6     B4
19     B4     C6
20     C6     A1
21     A1     B4
22     B4     C7
23     C7     A7
24     A7     B4
25     B4     C8
26     C8     A8
27     A8     B4
28     B4     C9

Old answer

You can use boolean masks:

m = pd.concat([df[col].duplicated() for col in ['A', 'B', 'C']], axis=1)

uniq = sum(~m.any(axis=1))  # number of unique rows

out = df[~m.any(axis=1)]  # you can also extract unique rows

Output:

>>> uniq
1

>>> m
       A      B      C
0  False  False  False  # never duplicated
1   True  False  False  # duplicated on A
2  False   True  False  # duplicated on B
3  False  False   True  # duplicated on C

>>> out
   row num   A   B   C
0        1  A1  B1  C1
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • Your solution is very fast, but I dont believe it would answer the issue. If I consider the following code: data = {'A': ['A1', 'A1', 'A2', 'A3'], 'B': ['B1', 'B2', 'B2', 'B3'], 'C': ['C1', 'C2', 'C3', 'C3'],} dft = pd.DataFrame(data) dft.loc[4] = ['A4', 'B4', 'C4'] dft.loc[5] = ['A5', 'B4', 'C5'] dft.loc[6] = ['A6', 'B4', 'C6'] dft.loc[7] = ['A1', 'B4', 'C7'] dft.loc[8] = ['A7', 'B4', 'C8'] dft.loc[9] = ['A8', 'B4', 'C9']` It should again show only one row, but instead it shows 2 rows. – Sale Aug 28 '23 at 18:00
  • Your solution is very fast, but I dont believe it would answer the issue. If I consider the following code: `data = {'A': ['A1', 'A1', 'A2', 'A3'], 'B': ['B1', 'B2', 'B2', 'B3'], 'C': ['C1', 'C2', 'C3', 'C3'],}` `dft = pd.DataFrame(data)` `dft.loc[4] = ['A4', 'B4', 'C4']` `dft.loc[5] = ['A5', 'B4', 'C5']` `dft.loc[6] = ['A6', 'B4', 'C6']` `dft.loc[7] = ['A1', 'B4', 'C7']` `dft.loc[8] = ['A7', 'B4', 'C8']` `dft.loc[9] = ['A8', 'B4', 'C9']` It should again show only one row, but instead it shows 2 rows. – Sale Aug 28 '23 at 18:28
  • @Sale. I don't understand. With your example, I have 2 rows (index 0 and 4) so `uniq` is 2. Check my update. – Corralien Aug 29 '23 at 06:11
  • Thanks for the update, that is the key, there is connection between those two rows you made unique, row 7 and 0 is having A1 the same, so it now connects two groups into one. So the result would be 1 "unique" record. – Sale Aug 29 '23 at 08:16
  • Ok. I got it. Sorry for my misunderstanding. I updated my answer. – Corralien Aug 29 '23 at 12:30