I have a problem using pd.merge when some of the rows in the two columns in the two datasets I use to merge the two datasets have different unicodes even though the strings are identical. Here is one example:
I have two datasets data1 and data2 both of which have 2 columns in common, 'state' and 'county', which will be the columns I use to merge the two datasets. I checked datatype for both 'state' and 'county' in two datasets. They are all of class 'str'.
By using
data_merge = pd.merge(data1, data2, on=['county','state'],how='right')
I should have a match between data1 row 308 and data2 row 20691 but it's not a match due to the 'county' in data1 row 308 and the 'county' in row 20691 have different unicodes:
I looked into the unicode of these two words (unicode1 is the unicode of county in data1 and unicode2 is the unicode of county in data2) and they are indeed different:
How do I go about merging these two datasets with this issue? Is there a way I can tell pd.merge to ignore the unicode differences? Thank you!