Keep values of dataframe that are contained in an other dataframe

Question

I have 2 dataframe that contain lists and i want to keep the elements of the first dataframe that are contained in the second dataframe. Is it possible or i must try some other data structures?

example of input:

df1:

elem1
a,c,v,b,n
b
c,x,a

df2:

elem2
j,k,a,i,v
o,b
g,f,w

expected output:

elem
a,v
b
NaN

The kernel dies maybe because i have a lot of data... – mnmbs Nov 16 '15 at 17:04 — mnmbs, Nov 16 '15 at 17:04

Nader Hisham · Accepted Answer · 2015-11-16T17:25:39.917

1

so first of all you can create a regular expression of letters you want to match

In [77]:
chars = df2.elem2.str.replace(',' , '|')
chars
Out[77]:
0    j|k|a|i|v
1          o|b
2        g|f|w
Name: elem2, dtype: object

the concatenate both into a data frame in order to perform a custom function later

In [24]:
to_compare = pd.concat([df1 , chars] , axis = 1)
to_compare
Out[24]:
       elem1    elem2
0   a,c,v,b,n   j|k|a|i|v
1   b           o|b
2   c,x,a       g|f|w

finally use your regular expression to match the date from elem1

In [76]:
to_compare.apply( lambda x : ','.join(re.findall(x['elem2'] , x['elem1'])) , axis = 1)
Out[76]:
0    a,v
1      b
2       
dtype: object

if you want to convert empty string from the final result to NAN , I'll leave you to figure it out on your own :-)

edited Nov 16 '15 at 17:25

answered Nov 16 '15 at 17:19

Nader Hisham

5,214
4
19
35

Thank you but i get an error at this line `to_compare.apply( lambda x : ','.join(re.findall(x['elem2'] , x['elem1'])) , axis = 1)` at re. why? – mnmbs Nov 16 '15 at 19:47
have you imported `regex` module `import re` ? – Nader Hisham Nov 16 '15 at 19:53
Is there any way to make it run with dataframes of different size? – mnmbs Nov 16 '15 at 20:21

score 1 · Answer 2 · edited May 23 '17 at 11:45

First columns are converted to lists by function str.split.

If indexes are same in both dataframes, you can easily add column from one df to another.

You can apply difference of sets converted from lists of columns and then convert to list. You have to use axis=1, because apply function to each row.

print df
#       elem1
#0  a,c,v,b,n
#1          b
#2      c,x,a
print df1
#       elem2
#0  j,k,a,i,v
#1        o,b
#2      g,f,w

#convert to lists
df['elem1list'] = df['elem1'].str.split(',')
df1['elem2list'] = df1['elem2'].str.split(',')

#add column from df1
df['elem2list']  = df1['elem2list'] 
print df
#       elem1        elem1list        elem2list
#0  a,c,v,b,n  [a, c, v, b, n]  [j, k, a, i, v]
#1          b              [b]           [o, b]
#2      c,x,a        [c, x, a]        [g, f, w]

df['elem'] = df.apply(lambda x:  list(set(x['elem2list']).intersection(x['elem1list'])), axis=1)
print df
#       elem1        elem1list        elem2list    elem
#0  a,c,v,b,n  [a, c, v, b, n]  [j, k, a, i, v]  [a, v]
#1          b              [b]           [o, b]     [b]
#2      c,x,a        [c, x, a]        [g, f, w]      []

`set(x['elem2list']).intersection(x['elem1list'])` is even more better at the last step — Nader Hisham, Nov 16 '15 at 20:10
This actually works for different size dataframes. Nice catch! — mnmbs, Nov 16 '15 at 21:09

Keep values of dataframe that are contained in an other dataframe

2 Answers2