1

Is there a more efficient way to get all rows that are related in some way to any other row in the same df (equality used in this example, actual function is a bit more complex):

import pandas as pd
from pydataset import data
df = data('iris')
df = df[df.index<10]
#adding ID col for ease of ref
df['ID'] = df.index
df

    Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species  ID
1           5.1          3.5           1.4          0.2  setosa   1
2           4.9          3.0           1.4          0.2  setosa   2
3           4.7          3.2           1.3          0.2  setosa   3
4           4.6          3.1           1.5          0.2  setosa   4
5           5.0          3.6           1.4          0.2  setosa   5
6           5.4          3.9           1.7          0.4  setosa   6
7           4.6          3.4           1.4          0.3  setosa   7

All I can come up with is a df.copy() and then loop:

df_copy = df.copy()
df_want = pd.DataFrame(columns=['ID','Sepal.Length','Sepal.Width','ExID', 'ExSepal.Length', 'ExSepal.Width'])

for row in range(0, df.shape[0]):
    for row2 in range(0, df_copy.shape[0]):
        if (df.iloc[row]['ID'] != df_copy.iloc[row2]['ID'] and
            df.iloc[row]['Sepal.Length'] == df_copy.iloc[row2]['Sepal.Length']):
                df_want = df_want.append({'ID':df.iloc[row]['ID'],
                                         'Sepal.Length':df.iloc[row]['Sepal.Length'],
                                         'Sepal.Width':df.iloc[row]['Sepal.Width'],
                                         'ExID':df_copy.iloc[row2]['ID'],
                                         'ExSepal.Length':df_copy.iloc[row2]['Sepal.Length'],
                                         'ExSepal.Width':df_copy.iloc[row2]['Sepal.Width']},
                                          ignore_index=True)
df_want

ID  Sepal.Length Sepal.Width    ExID    ExSepal.Length  ExSepal.Width
4.0  4.6             3.1        7.0          4.6          3.4
5.0  5.0             3.6        8.0          5.0          3.4
7.0  4.6             3.4        4.0          4.6          3.1
8.0  5.0             3.4        5.0          5.0          3.6

So row_ID 4 is the same as 7, and 5 is the same as 8, and so on..

Tried searching quite a bit, closest I found was this Select rows from a DataFrame based on values in a column in pandas but struggling to use isin() in this case.

Grizzly2501
  • 113
  • 1
  • 3
  • 10
  • let me get this straight.. you want to compare two same df where a column value namely Sepal.Length is equal...right ? because if thats the case then use group by and pick the group is length more than one – iamklaus Oct 02 '18 at 09:51
  • Apologies, should clarify a bit further: the output listed above is as I would need it. The `ExID` column has to be there. A join/merge does not work here because the function being applied is not simply a check for equality obviously – Grizzly2501 Oct 02 '18 at 10:34

3 Answers3

0

Try the following 2 approaches. The first as suggested by @Sarthak Negiusing by using group-by:

df.groupby('Sepal.Length', as_index=True).apply(lambda x: x if len(x)>1 else None)

The second approach is simply dropping the non-duplicate values:

ndf = df.drop(df.drop_duplicates(subset='Sepal.Length', keep=False).index)

Edit: add ExId

This was a bit complicated and it might not look pretty. The approach here, is creating d1 dataframe that contains the first duplicates and d2 that contains the last duplicates, and assign the d1 IDs to d2 and vice versa..

# keep first duplicates 
d1=ndf.drop_duplicates(subset='Sepal.Length').reset_index(drop=True)

# Keep last duplicates
d2=ndf.drop_duplicates(subset='Sepal.Length', keep='last').reset_index(drop=True)

d1['ExId'] = d2.ID
d2['ExId'] = d1.ID

# append
d1.append(d2).reset_index(drop=True)

Output

    Sepal.Length    Sepal.Width Petal.Length    Petal.Width Species ID ExId
0   4.6             3.1         1.5             0.2         setosa  4  7
1   5.0             3.6         1.4             0.2         setosa  5  8
2   4.6             3.4         1.4             0.3         setosa  7  4
3   5.0             3.6         1.4             0.2         setosa  8  5
gyx-hh
  • 1,421
  • 1
  • 10
  • 15
  • That output is partially correct. I need to know which row the match comes from, as per my output (col name ExID) – Grizzly2501 Oct 02 '18 at 10:23
  • Ahh sorry i missed that.. quick questions regrading this, can there be more than 2 ID's that match? – gyx-hh Oct 02 '18 at 10:32
  • @Grizzly2501 Just updated it.. Let me know if this works for you – gyx-hh Oct 02 '18 at 11:04
  • I thought you were expanding upon the `lambda` function. Sorry, I should perhaps have bolded more portions of my title. This won't work because I am not interested in removing duplicates or checking for only equality (this was just to illustrate my problem), I am checking if one thing is related in some way to another via a function. Hence the `lambda` solution may prove useful if I can obtain `ExID` – Grizzly2501 Oct 02 '18 at 11:22
  • ok and by related you mean 2 rows share a column value (equal) of one of the columns ? This would also mean your `ExID` column will be a list of all IDs that are similar ? In you example, ID 5 shares `Sepal.Length` with ID 8 and shares `Petal.Length` with ID 7 and 8, So your `ExID` for ID 4 will be [7,8] ? – gyx-hh Oct 02 '18 at 11:53
0

A solution may be to build a boolean filter on the duplicates in each column and combine that into a total filter.

Your example only combined the first 2 columns, so this code does the same, change the df.columns[:2] for a different column delimitation.

Note that for combining all filters need the same name, that's why I rename them.

import pandas as pd
from pydataset import data
df = data('iris')
df = df[df.index<10]
#adding ID col for ease of ref
df['ID'] = df.index

total_filter = None

for c in df.columns[:2]:
    print('checking column', c)
    filter = df[c].duplicated(keep=False).rename('dupe')
    if total_filter is None:
        total_filter = filter
    else:    
        total_filter = total_filter | filter

print(df[total_filter])

Result

checking column Sepal.Length
checking column Sepal.Width
   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species  ID
4           4.6          3.1           1.5          0.2  setosa   4
5           5.0          3.6           1.4          0.2  setosa   5
7           4.6          3.4           1.4          0.3  setosa   7
8           5.0          3.4           1.5          0.2  setosa   8
576i
  • 7,579
  • 12
  • 55
  • 92
0

Another approach.. results is not in the format as you mentioned.. they are grouped

data = pd.read_csv('iris.data.txt', sep=',', header=None)
data.columns = ['Sepal.Length' , 'Sepal.Width' , 'Petal.Length',  'Petal.Width' ,'Species' , 'ID']
data['ID'] = data.index

#I guess you dont want these
data.drop(['Petal.Width','Petal.Length','Species'], axis=1, inplace=True)

def check(data):
    if len(data) > 1:
        index_list = list(data.index.values)
        index_list.append(index_list[0])
        data['ExSepal.Length'] = data['Sepal.Length']
        data['ExSepal.Width'] = data['Sepal.Width']
        data['ExId'] = [int(index_list[i]) for i in range(1,len(index_list))]
        return data

data.groupby('Sepal.Length').apply(check)

Output

                 Sepal.Length  Sepal.Width  ID  ExSepal.Length  ExSepal.Width  \
Sepal.Length                                                                    
4.4          8            4.4          2.9   8             4.4            2.9   
             38           4.4          3.0  38             4.4            3.0   
             42           4.4          3.2  42             4.4            3.2   
4.6          3            4.6          3.1   3             4.6            3.1   
             6            4.6          3.4   6             4.6            3.4   
             22           4.6          3.6  22             4.6            3.6   
             47           4.6          3.2  47             4.6            3.2   
4.7          2            4.7          3.2   2             4.7            3.2   
             29           4.7          3.2  29             4.7            3.2   
4.8          11           4.8          3.4  11             4.8            3.4   

                 ExId  
Sepal.Length           
4.4          8     38  
             38    42  
             42     8  
4.6          3      6  
             6     22  
             22    47  
             47     3  
4.7          2     29  
             29     2  
4.8          11    12  
iamklaus
  • 3,720
  • 2
  • 12
  • 21