0

I would like to join two columns to create a new column in a pandas dataframe :

df:

 id  v_1   v_2                         v_3
 35  'dfa' [u'cszc', u'bdv', u'yhs']   [u'cszc', u'bdv']  
 78  'dfa' [u'scaw', u'ygf', u'ompt']  [u'ompt', u'bdv']
 99  'dfa' [u'svca', u'yve', u'wwca']  [u'thbsd', u'tbs']

I need:

 id  v_1   v_2                         v_3                         new_v_4    new_v_5
 35  'dfa' [u'cszc', u'bdv', u'yhs']   [u'cszc', u'bdv', 'zv']     [u'bdv']  2/3
 78  'dfa' [u'scaw', u'ygf', u'ompt']  [u'ompt', u'bdv', 'tyn']    [u'ompt'] 1/3
 99  'dfa' [u'svca', u'yve', u'wwca']  [u'thbsd', u'tbs']               []     0

The "new_v_4" is to collect the intersections of column "v_2" and "v_3". The "new_v_5" is the percentage of the size of intersection over the size of "v_2". The "v_2" and "v_3" schema is object. I prefer "new_v_4" is an array of string. I tried to use "join" but do not know how to join the two object columns in one dataframe.

Raw input:

df = pd.DataFrame([[35, 'dfa', [u'cszc', u'bdv', u'yhs'],   [u'cszc', u'bdv']],
[78, 'dfa', [u'scaw', u'ygf', u'ompt'],  [u'ompt', u'bdv']],
[99, 'dfa', [u'svca', u'yve', u'wwca'],  [u'thbsd', u'tbs']]], columns=['id','v_1','v_2','v_3'])
user3448011
  • 1,469
  • 1
  • 17
  • 39

1 Answers1

0

Get the intersection by list(set & set)

df['new_v_4'] = [list(set(a) & set(b)) for a, b in zip(df.v_2, df.v_3)]

Percentage can be calculated in a similar list comprehension.

df['new_v_5'] = [len(a)/len(b) for a, b in zip(df.new_v_4, df.v_2)] 

Result

   id  v_1                v_2           v_3      new_v_4   new_v_5
0  35  dfa   [cszc, bdv, yhs]   [cszc, bdv]  [cszc, bdv]  0.666667
1  78  dfa  [scaw, ygf, ompt]   [ompt, bdv]       [ompt]  0.333333
2  99  dfa  [svca, yve, wwca]  [thbsd, tbs]           []  0.000000
Emma
  • 8,518
  • 1
  • 18
  • 35
  • thanks, I tried your query, it seems that the new column "new_v_4" just collect all intersected letters (including "u", which should not be there) not word level intersection. – user3448011 Jan 06 '22 at 00:03
  • Sounds like you have stringified list in `v_2` or `v_3`. could you check the type of data in `v_2` and `v_3`? – Emma Jan 06 '22 at 01:10
  • Your solution works, at least for the given input. OP's data must not be like they say it is. –  Jan 06 '22 at 01:26