3

I have 2 dataframe that contain lists and i want to keep the elements of the first dataframe that are contained in the second dataframe. Is it possible or i must try some other data structures?

example of input:

df1:

elem1
a,c,v,b,n
b
c,x,a

df2:

elem2
j,k,a,i,v
o,b
g,f,w

expected output:

elem
a,v
b
NaN
Nader Hisham
  • 5,214
  • 4
  • 19
  • 35
mnmbs
  • 353
  • 3
  • 13

2 Answers2

1

so first of all you can create a regular expression of letters you want to match

In [77]:
chars = df2.elem2.str.replace(',' , '|')
chars
Out[77]:
0    j|k|a|i|v
1          o|b
2        g|f|w
Name: elem2, dtype: object

the concatenate both into a data frame in order to perform a custom function later

In [24]:
to_compare = pd.concat([df1 , chars] , axis = 1)
to_compare
Out[24]:
       elem1    elem2
0   a,c,v,b,n   j|k|a|i|v
1   b           o|b
2   c,x,a       g|f|w

finally use your regular expression to match the date from elem1

In [76]:
to_compare.apply( lambda x : ','.join(re.findall(x['elem2'] , x['elem1'])) , axis = 1)
Out[76]:
0    a,v
1      b
2       
dtype: object

if you want to convert empty string from the final result to NAN , I'll leave you to figure it out on your own :-)

Nader Hisham
  • 5,214
  • 4
  • 19
  • 35
1

First columns are converted to lists by function str.split.

If indexes are same in both dataframes, you can easily add column from one df to another.

You can apply difference of sets converted from lists of columns and then convert to list. You have to use axis=1, because apply function to each row.

print df
#       elem1
#0  a,c,v,b,n
#1          b
#2      c,x,a
print df1
#       elem2
#0  j,k,a,i,v
#1        o,b
#2      g,f,w

#convert to lists
df['elem1list'] = df['elem1'].str.split(',')
df1['elem2list'] = df1['elem2'].str.split(',')

#add column from df1
df['elem2list']  = df1['elem2list'] 
print df
#       elem1        elem1list        elem2list
#0  a,c,v,b,n  [a, c, v, b, n]  [j, k, a, i, v]
#1          b              [b]           [o, b]
#2      c,x,a        [c, x, a]        [g, f, w]

df['elem'] = df.apply(lambda x:  list(set(x['elem2list']).intersection(x['elem1list'])), axis=1)
print df
#       elem1        elem1list        elem2list    elem
#0  a,c,v,b,n  [a, c, v, b, n]  [j, k, a, i, v]  [a, v]
#1          b              [b]           [o, b]     [b]
#2      c,x,a        [c, x, a]        [g, f, w]      []
Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252