I have two data frames containing a common variable, 'citation'. I am trying to check if values of citation in one data frame are also values in the other data frame. The problem is that the variables are of different format. In one data frame the variables appear as:
0154/0924
0022/0320
whereas in the other data frame they appear as:
154/ 924
22/ 320
the differences being: 1) no zeros before the first non-zero integer of the number before the hyphen and 2) zeros that appear after the hyphen but before the first non-zero integer after the hyphen are replaced with spaces, ' ', in the second data frame.
I am trying to use a function and apply it, as shown in the code below, but I am having trouble with regex and I could not find documentation on this exact problem.
def Clean_citation(citation):
# Search for opening bracket in the name followed by
# any characters repeated any number of times
if re.search('\(.*', citation):
# Extract the position of beginning of pattern
pos = re.search('\(.*', citation).start()
# return the cleaned name
return citation[:pos]
else:
# if clean up needed return the same name
return citation
df['citation'] = df['citation'].apply(Clean_citation)