0

I have two data frames, one named 'foo' and one named 'bar'. My dataframe 'foo' has some unique columns and my dataframe 'bar' also has some unique columns. However, they both share one column, the column 'google'. I am trying to see if there is a way to keep all columns in data frame 1, 'foo', and add one additional column, the column 'CLRS', which will be 1 if the content in the column 'google' in that row of 'foo' appears somewhere in the column 'google' in 'bar'.

More specifically, let's assume that the dataframes I have are structured like this: 'foo' contains columns: 'foo_1','foo_2', ..., 'google' and bar contains columns: 'bar_1', 'bar_2, ..., 'google'. I want to join/merge 'foo' and 'bar' in a way such that 'foo' has an additional column, 'CLRS', such that 'CLRS' has a 1 if the contents of 'google' in that row of 'foo' appear at some point in the 'google' column of 'bar'. I have tried the following code:

      '''
         # foo examples
         foo['foo1'] = ['dijkstra','TSP',...]
         foo['foo2'] = ['Oculus','VR', ...]
         .
         .
         .
         foo['google'] = ['search','ads', 'A/B Testing', 'UI' ...]

         # bar examples
         bar['bar1'] = ['dijkstra','TSP',...]
         bar['bar2'] = ['search','ads', ...]
         .
         .
         .
         # 'A/B Testing' appears in the column somewhere but 'ads' does 
         # not
         bar['google'] = ['search','google_search', 'TDD', 'UI', 
         ...,'A/B Testing', ...]

         # my code
         foo_merged = 
                    foo.join(bar, how = 'left')

         # my result 
         foo_merged['foo1'] = ['dijkstra','TSP',...]
         foo_merged['foo2'] = ['search','ads', ...]
         .
         .
         .
         foo_merged['google'] = ['search','ads', ...]
         foo_merged['CLRS']   = ['search','google_search', 'TDD', 'UI', 
         ...]

         # What I want as an output for foo_merged is:
         foo_merged['foo1'] = ['dijkstra','TSP',...]
         foo_merged['foo2'] = ['search','ads', ...]
         .
         .
         .
         foo_merged['google'] = ['search','ads', 'A/B Testing', 'UI' 
         ...]
         foo_merged['CLRS']   = [1,0,1,1,...]
      '''

Unfortunately, after running the previous join code, foo_merged contains all columns of foo and one additional column which contains always the contents of the column 'google' from 'bar'. My desired result instead would be a df such that the additional column 'CLRS' contains 1 if the content of 'google' in that row of 'foo' appears somewhere as a content of the column 'google' in 'bar' and 0 otherwise.

1 Answers1

0

I believe you are looking for merge with indicator = True.
Indicator will flag each row that is in or not in both DataFrames

df = pd.merge(foo, bar, how='left', on = 'google', indicator = True)
df['CLRS'] = (df['_merge'] == 'both').astype(int)    
#or df['CLRS'] = np.where(df['_merge'] == 'both', 1, 0)
Terry
  • 2,761
  • 2
  • 14
  • 28