3

Many thanks for reading.

I have a pandas data frame of roughly 200,000 rows and 46 columns. 23 of these columns end in "_1" and the other 23 end in "_2". For example:

forename_1   surname_1   area_1   forename_2   surname_2   area_2
    george       neil       g         jim         bob        k
    jim          bob        k         george      neil       g
    pete         keith      k         dan         joe        q
    dan          joe        q         pete        keith      k
    ben          steve      w         richard     ed         p
    charlie      david      s         graham      josh       l

I have successfully removed duplicates using drop_duplicates, but now want to remove rows that are duplicates but the group they are in (1 or 2) has been inverted.

That is, for one row, I want to compare the combined values in forename_1, surname_1 and area_1 with the combined values in forename_2, surname_2 and area_2 for all other rows.

The kind of test I am looking to use would be something like:

If "forename_1 + surname_1 + area_1 + forename_2 + surname_2 + area_2" = "forename_2 + surname_2 + area_2 + forename_1 + surname_1 + area_1", then de-duplicate

I would want to only keep the first duplicate row out of the x number of duplicates (e.g. keep='first').

To help explain, there are two cases above where a duplicate would need to removed:

forename_1   surname_1   area_1   forename_2   surname_2   area_2
george       neil       g         jim         bob        k
jim          bob        k         george      neil       g



forename_1   surname_1   area_1   forename_2   surname_2   area_2    
pete         keith      k         dan         joe        q
dan          joe        q         pete        keith      k

george + neil + g + jim + bob + k = george + neil + g + jim + bob + k etc...

In each case, the second row of the two would be removed, meaning my expected output would be:

forename_1   surname_1   area_1   forename_2   surname_2   area_2
    george       neil       g         jim         bob        k
    pete         keith      k         dan         joe        q
    ben          steve      w         richard     ed         p
    charlie      david      s         graham      josh       l

I have seen an answer that deals with this in R, but is there also a way that this can be done in Python?

Compare group of two columns and return index matches R

Many thanks.

Joseph0210
  • 195
  • 7

3 Answers3

1

Use:

df1 = pd.DataFrame(np.sort(df.values, axis=1), index=df.index).drop_duplicates()
print (df1)
         0      1       2        3      4     5
0      bob      g  george      jim      k  neil
2      dan    joe       k    keith   pete     q
4      ben     ed       p  richard  steve     w
5  charlie  david  graham     josh      l     s

df2 = df.loc[df1.index]
print (df2)
  forename_1 surname_1 area_1 forename_2 surname_2 area_2
0     george      neil      g        jim       bob      k
2       pete     keith      k        dan       joe      q
4        ben     steve      w    richard        ed      p
5    charlie     david      s     graham      josh      l

print (pd.DataFrame(np.sort(df.values, axis=1), index=df.index))
         0        1       2        3      4     5
0      bob        g  george      jim      k  neil
1      bob        g  george      jim      k  neil
2      dan      joe       k    keith   pete     q
3      dan      joe       k    keith   pete     q
4      ben       ed       p  richard  steve     w
5  charlie    david  graham     josh      l     s
6      bob  charlie   david      jim      k     s

df1 = pd.DataFrame(np.sort(df.values, axis=1), index=df.index).drop_duplicates()
print (df1)
         0        1       2        3      4     5
0      bob        g  george      jim      k  neil
2      dan      joe       k    keith   pete     q
4      ben       ed       p  richard  steve     w
5  charlie    david  graham     josh      l     s
6      bob  charlie   david      jim      k     s

df2 = df.loc[df1.index]
print (df2)
  forename_1 surname_1 area_1 forename_2 surname_2 area_2
0     george      neil      g        jim       bob      k
2       pete     keith      k        dan       joe      q
4        ben     steve      w    richard        ed      p
5    charlie     david      s     graham      josh      l
6    charlie     david      s        jim       bob      k
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks for the reply jezrael. I want to keep the pairs of names on each row together. I think if the two groups (_1 and _2) are stacked and unstacked then this results in some of the pairs being lost. – Joseph0210 Sep 04 '17 at 12:34
  • yes, dupes will be lost. They are replaced by Nones. Is not better create 3 column dataframe in output? (idea only) – jezrael Sep 04 '17 at 12:36
  • In your example above you added a fifth row with row values "charlie, david, s, jim bob, k. I would not want to lose this row as there does not exist a second row in the dataset with the column values of jim, bob, k, charlie, david, s. – Joseph0210 Sep 04 '17 at 12:41
  • 1
    I add another solution - it sort all values per rows and remove duplicates. Then select by loc only all rows with same index as deduplicated df. – jezrael Sep 04 '17 at 12:43
  • I test it with `5.th row` and it does not remove it. – jezrael Sep 04 '17 at 12:45
  • 1
    Thanks a lot for your help, I will test this on a sample of my data and check the results – Joseph0210 Sep 04 '17 at 13:01
1

I think there is a problem with using np.sort(df.values, axis=1). While it sorts each row independently (good), it does not respect which column the values come from (bad). In other words, these two hypothetical rows

forename_1   surname_1   area_1   forename_2   surname_2   area_2
    george        neil        g          jim         bob        k
    george        jim         k         neil         bob        g

would get sorted identically

In [377]: np.sort(np.array([['george', 'neil', 'g', 'jim', 'bob', 'k'],
                            ['george', 'jim', 'k', 'neil', 'bob', 'g']]), axis=1)
   .....: Out[377]: 
array([['bob', 'g', 'george', 'jim', 'k', 'neil'],
       ['bob', 'g', 'george', 'jim', 'k', 'neil']],
      dtype='<U6')

even though their (forename, surname, area) triplets are different.

To handle this possibility, we could instead use jezrael's original stack/unstack approach, with a df.sort_values sandwiched in the middle:

import numpy as np
import pandas as pd
df = pd.DataFrame(
    {'area_1': ['g', 'k', 'k', 'k', 'q', 'w', 's'],
     'area_2': ['k', 'g', 'g', 'q', 'k', 'p', 'l'],
     'forename_1': ['george', 'george', 'jim', 'pete', 'dan', 'ben', 'charlie'],
     'forename_2': ['jim', 'neil', 'george', 'dan', 'pete', 'richard', 'graham'],
     'surname_1': ['neil', 'jim', 'bob', 'keith', 'joe', 'steve', 'david'],
     'surname_2': ['bob', 'bob', 'neil', 'joe', 'keith', 'ed', 'josh']})

def using_stack_sort_unstack(df):
    df = df.copy()
    df.columns = df.columns.str.split('_', expand=True)
    df2 = df.stack()
    df2 = df2.sort_values(by=['forename', 'surname', 'area'])
    colnum = (df2.groupby(level=0).cumcount()+1).astype(str)
    df2.index = pd.MultiIndex.from_arrays([df2.index.get_level_values(0), colnum])
    df2 = df2.unstack().drop_duplicates()
    df2.columns = df2.columns.map('_'.join)
    return df2

print(using_stack_sort_unstack(df))

yields

  area_1 area_2 forename_1 forename_2 surname_1 surname_2
0      g      k     george        jim      neil       bob
1      k      g     george       neil       jim       bob
3      q      k        dan       pete       joe     keith
5      w      p        ben    richard     steve        ed
6      s      l    charlie     graham     david      josh

The purpose of the stack/sort/unstack operations:

    df2 = df.stack()
    df2 = df2.sort_values(by=['forename', 'surname', 'area'])
    colnum = (df2.groupby(level=0).cumcount()+1).astype(str)
    df2.index = pd.MultiIndex.from_arrays([df2.index.get_level_values(0), colnum])
    df2 = df2.unstack().drop_duplicates()

is to sort the ('forename', 'surname', 'area') triplets in each row individually. The sorting helps drop_duplicates identify (and drop) rows which we want to consider identical.


This shows the difference between using_stack_sort_unstack and using_npsort. Notice that using_npsort(df) returns 4 rows while using_stack_sort_unstack(df) returns 5 rows:

def using_npsort(df):
    df1 = pd.DataFrame(np.sort(df.values, axis=1), index=df.index).drop_duplicates()
    df2 = df.loc[df1.index]
    return df2
print(using_npsort(df))

#   area_1 area_2 forename_1 forename_2 surname_1 surname_2
# 0      g      k     george        jim      neil       bob
# 3      k      q       pete        dan     keith       joe
# 5      w      p        ben    richard     steve        ed
# 6      s      l    charlie     graham     david      josh
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thanks very much for your answer, I did wonder if the sorting in the previous answer would get rid of some pairs which weren't duplicates. I will test this updated answer soon and let you know if it is successful, thanks! – Joseph0210 Sep 04 '17 at 17:03
  • Will this method work when there are also columns with integer values? It seems I may have to convert these to strings before doing the sorting? Many thanks again for the help – Joseph0210 Sep 05 '17 at 07:56
  • 1
    The method above should work fine with numeric column values, AFAICS. But if you can post an example which demonstrates the problem, we'll take a look at it. – unutbu Sep 05 '17 at 15:32
1

I had this problem too. I have a data frame where rows may have two columns swapped, Sample1 & Sample2. My example may be a bit more robust, but I wanted to be sure.

data = pd.DataFrame({  'Sample1': [ 'AT1', 'AT1', 'AT1', 'AT1', 'AT2', 'AT2', 'AT2', 'AT2', 'AT3', 'AT3', 'AT3', 'AT3', 'AT4', 'AT4', 'AT4', 'AT4', 'AT5', 'AT5', 'AT5', 'AT5'],  'Sample2': [ 'AT2', 'AT3', 'AT4', 'AT5', 'AT1', 'AT3', 'AT4', 'AT5', 'AT1', 'AT2', 'AT4', 'AT5', 'AT1', 'AT2', 'AT3', 'AT5', 'AT1', 'AT2', 'AT3', 'AT4'],  's1v': [ 53, 69, 44, 54, 27, 60, 11, 98, 16, 48, 50, 68, 89, 9, 20, 0, 14, 42, 1, 85], 's2v': [ 27, 16, 89, 14, 53, 48, 9, 42, 69, 60, 20, 1, 44, 11, 50, 85, 54, 98, 68, 0], 'v': [ 74, 0, 36, 87, 74, 87, 64, 64, 0, 87, 2, 54, 36, 64, 2, 51, 87, 64, 54, 51] })

# Add duplicate marker column 
data['duplicate'] = False

# Scan table for duplicate rows 
for index, row in data.iterrows():
    s1 = row['Sample1']
    s2 = row['Sample2']
    data_row = data.iloc[ index ]
    if data_row['duplicate'] == False:
        dup_row = data.loc[ (data['Sample1'] == s2) & (data['Sample2'] == s1) ]
        if not dup_row.empty:
            data.loc[ (data['Sample1'] == s2) & (data['Sample2'] == s1), 'duplicate' ] = True

# Subset data 
data = data.loc[ data['duplicate'] == False ]

I couldn't use row from iterrows() because it doesn't contain the updated values in the dataframe, which is why I have data_row...

Adil B
  • 14,635
  • 11
  • 60
  • 78
ryang
  • 93
  • 1
  • 6