Match string values in two different DataFrames and create a new column with match indicator in Pandas

Question

I have two dataframes (df1, df2) and I would like to create a new column in df1 that indicates if there is a match in the code columns between each dataframe. The code column in df2 is made up of strings separated by a comma.

df1

Date        Code
2016-01-01  LANH08
2016-01-01  LAOH07
2016-01-01  LAPH09
2016-01-01  LAQH06
2016-01-01  LARH03

df2

Date        Code
2016-01-01  LANH08, LAOH07, LXA0EW, LAGRL1
2016-01-01  LAUH02, LAVH00, LAVH01, LAYH00
2016-01-01  LANH08
2016-01-01  AAH00, ABH00, XAH03
2016-01-01  ARH04, BA0BW, BMH01, DPH00

My Goal

df1

Date        Code    Match
2016-01-01  LANH08  Y
2016-01-01  LAOH07  Y
2016-01-01  LAPH09  N
2016-01-01  LAQH06  N
2016-01-01  LARH03  N

Charlie Haley · Answer 1 · 2016-02-25T21:06:51.920

0

#Split df2['Code'] into an array
df2.Code = df2.Code.str.split(', ')

#Recreate df2 reshaped
df2 = pd.concat([pd.DataFrame(dict(list(zip(df2.columns,df2.ix[i]))),\
      index=range(len(list(zip(df2.columns,df2.ix[i]))[1]))) for i in range(len(df2.index))])


#default df2['match'] to 'Y'
df2['Match'] = 'Y'

#Create new dataframe by left merging df1 with df2
df3 = df1.merge(df2, left_on = ['Date','Code'], right_on = ['Date','Code'], how = 'left')

#Fill NaN values in Match column with 'N' (because they weren't in df2)
df3['Match'] = df3['Match'].fillna('N')

edited Feb 25 '16 at 21:06

answered Feb 25 '16 at 18:52

Charlie Haley

4,152
4
22
36

Ok I get an error - `ValueError: If using all scalar values, you must pass an index`. Looks to be referencing line - `df2 = pd.concat([pd.DataFrame(dict(zip(df2.columns,df2.ix[i]))) for i in range(len(df2))])` in the traceback. Note that I created df1 and df2 from two existing dataframes. Perhaps an problem with the data? – shavar Feb 25 '16 at 19:30
I just changed it, test that out. – Charlie Haley Feb 25 '16 at 19:37
I think getting closer. I'm getting a TypeError - 'TypeError: 'zip' object is not subscriptable'. – shavar Feb 25 '16 at 19:49
Did you type the change in instead of copy/pasting it? It sounds like you misaligned your parentheses and subscripts. – Charlie Haley Feb 25 '16 at 19:53
Initially copy/paste but have since retyped and ran on one line and still get the error. I'm on Python3 if that makes a difference. – shavar Feb 25 '16 at 21:03
Yes it does make a difference. In python2, zip returns a list, and in python3, zip returns a generator. So I called `list()` on the `zip` functions. – Charlie Haley Feb 25 '16 at 21:07
Ok now I'm getting `ValueError: Shape of passed values is (2, 4), indices imply (2, 2)`. So I changed df2 to 2x2 dataframe and it works on that. I tried to figure this out on my own but after a bit of reading on dict, list and zip I am still unable to work through it. I do understand what your code is doing but I don't understand how indices imply 2,2. It would seem that the shape would change on each iteration depending on the number of items in the Code field. I appreciate your help to this point. – shavar Feb 26 '16 at 14:22

score 0 · Accepted Answer · edited May 23 '17 at 12:07

Split pandas dataframe string entry to separate rows

Final Solution:

data1 = {'Date':['2016-01-01',
                 '2016-01-01',
                 '2016-01-01',
                 '2016-01-01',
                 '2016-01-01'],
         'Code':['LANH08',
                 'LAOH07',
                 'LAPH09',
                 'LAQH06',
                 'LARH03']}
df1 = DataFrame(data1)

data2 = {'Date':['2016-01-01',
                 '2016-01-01',
                 '2016-01-01',
                 '2016-01-01',
                 '2016-01-01'],
         'Code':['LANH08, LAOH07, LXA0EW, LAGRL1',
                 'LAUH02, LAVH00, LAVH01, LAYH00',
                 'LANH08',
                 'AAH00, ABH00, XAH03',
                 'LAUH02, LAVH00']}
df2 = DataFrame(data2)

df2 = DataFrame(df2.Code.str.split(', ').tolist(), index=df2.Date).stack().drop_duplicates()
df2 = df2.reset_index()[[0, 'Date']] # Code variable is currently labeled 0
df2.columns = ['Code', 'Date'] # Renaming Code

# default df2['match'] to 'Y'
df2['Match'] = 'Y'

# Create new dataframe by left merging df1 with df2
df3 = df1.merge(df2, left_on = ['Code', 'Date'], right_on = ['Code', 'Date'], how = 'left')

# Fill NaN values in Match column with 'N' (because they weren't in df2)
df3['Match'] = df3['Match'].fillna('N')

df3

    Code    Date        Match
0   LANH08  2016-01-01  Y
1   LAOH07  2016-01-01  Y
2   LAPH09  2016-01-01  N
3   LAQH06  2016-01-01  N
4   LARH03  2016-01-01  N

Match string values in two different DataFrames and create a new column with match indicator in Pandas

2 Answers2

Split pandas dataframe string entry to separate rows