2

I want to remove data from one dataframe based on data in another dataframe. I found a way to do this (see below), but I was wondering if there was a more efficient way of doing that. Here's the code that I want to improve:

# -*- coding: utf-8 -*-

import pandas as pd

#df1 is the dataframe where I want to remove data from
d1 = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.], 'three' : [5.,6.,7.,8.] }
df1 = pd.DataFrame(d1)
df1.columns = ['one', 'two', 'three'] #Keeping the order of the columns as defined
print 'df1\n', df1
#print df1

#I want to remove all the rows from df1 that are also in df2
d2 = {'one' : [2., 4.], 'two' : [3., 1], 'three' : [6.,8.] }
df2 = pd.DataFrame(d2)
df2.columns = ['one', 'two', 'three'] #Keeping the order of the columns as defined
print 'df2\n', df2


#df3 is the output I want to get: it should have the same data as df1, but without the data that is in df2
df3 = df1

#Create some keys to help identify rows to be dropped from df1
df1['key'] = df1['one'].astype(str)+'-'+df1['two'].astype(str)+'-'+df1['three'].astype(str)
print 'df1 with key\n', df1
df2['key'] = df2['one'].astype(str)+'-'+df2['two'].astype(str)+'-'+df2['three'].astype(str)
print 'df2 with key\n', df2

#List of rows to remove from df1
rowsToDrop = []

#Building the list of rows to drop
for i in df1.index:
    if df1['key'].irow(i) in df2.ix[:,'key'].values:
        rowsToDrop.append(i)

#Dropping rows from df1 that are also in df2
for j in reversed(rowsToDrop):
    df3 = df3.drop(df3.index[j])   

df3.drop(['key'], axis=1, inplace=True)        

#Voilà!
print 'df3\n', df3

Thank you for your help.

Tony Mignot
  • 23
  • 1
  • 3
  • when you say `df3 = df1`, `df3` will reflect any changes that you make to `df1` and *vice versa*. you should say `df3 = df1.copy()` instead. – Paul H Nov 15 '14 at 18:09
  • Also, this isn't really a join operation; it's a selection. I think you should edit the title to reflect that. – Paul H Nov 15 '14 at 18:14
  • What I'm trying to do is what this website, http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins, calls "left excluding join". – Tony Mignot Nov 15 '14 at 23:15
  • But you only want the columns from one of the dataframes, right? Joins are used to align rows and columns from different tables. All you're doing is selecting the data based on elements that are incidentally stored in a different dataframe. There's a subtle difference. – Paul H Nov 15 '14 at 23:23

2 Answers2

1

This will work using the data frame df1 and the dict d2

df3 = df1[~df1.isin(d2)].dropna()

You can pass a df to isin() but I don't think it will you give you the results you're looking for because i believe it looks at the indexes too.

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.isin.html

Bob Haffner
  • 8,235
  • 1
  • 36
  • 43
1

You're looking more for a syntax for selecting rows, not joining dataframes.'

A true left join would look like this:

import numpy as np
import pandas as pd

d1 = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.], 'three' : [5.,6.,7.,8.] }
df1 = pd.DataFrame(d1)
df1['key'] = df1['one'].astype(str)+'-'+df1['two'].astype(str)+'-'+df1['three'].astype(str)
df1.set_index('key', inplace=True)


d2 = {'one' : [2., 4.], 'two' : [3., 1], 'three' : [6.,8.] }
df2 = pd.DataFrame(d2)
df2['key'] = df2['one'].astype(str)+'-'+df2['two'].astype(str)+'-'+df2['three'].astype(str)
df2.set_index('key', inplace=True)

df1.join(df2, how='left', lsuffix='_df1', rsuffix='_df2')


             one_df1  three_df1  two_df1  one_df2  three_df2  two_df2
key                                                                  
1.0-4.0-5.0        1          5        4      NaN        NaN      NaN
2.0-3.0-6.0        2          6        3        2          6        3
3.0-2.0-7.0        3          7        2      NaN        NaN      NaN
4.0-1.0-8.0        4          8        1        4          8        1

Doing a right join:

df1.join(df2, how='right', lsuffix='_df1', rsuffix='_df2')

Produces this:

             one_df1  three_df1  two_df1  one_df2  three_df2  two_df2
key                                                                  
2.0-3.0-6.0        2          6        3        2          6        3
4.0-1.0-8.0        4          8        1        4          8        1
Paul H
  • 65,268
  • 20
  • 159
  • 136
  • Thanks Paul. Yes I have looked at DataFrame.join() but it doesn't give what I want in this case, since the rows from df2 are still in the output. – Tony Mignot Nov 15 '14 at 23:17
  • @TonyMignot I know -- my point is that since you don't actually want a join operation, you should edit the title of the question to better reflect what you actually want. Also, you can use `.dropna()` to remove those rows or use a `right` join. – Paul H Nov 15 '14 at 23:20