1

I have two data frames containing a common variable, 'citation'. I am trying to check if values of citation in one data frame are also values in the other data frame. The problem is that the variables are of different format. In one data frame the variables appear as:

0154/0924
0022/0320

whereas in the other data frame they appear as:

154/ 924
22/ 320

the differences being: 1) no zeros before the first non-zero integer of the number before the hyphen and 2) zeros that appear after the hyphen but before the first non-zero integer after the hyphen are replaced with spaces, ' ', in the second data frame.

I am trying to use a function and apply it, as shown in the code below, but I am having trouble with regex and I could not find documentation on this exact problem.

def Clean_citation(citation): 
    # Search for opening bracket in the name followed by 
    # any characters repeated any number of times 
    if re.search('\(.*', citation): 

        # Extract the position of beginning of pattern 
        pos = re.search('\(.*', citation).start() 

        # return the cleaned name 
        return citation[:pos] 

    else: 
        # if clean up needed return the same name 
        return citation 


df['citation'] = df['citation'].apply(Clean_citation)
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Graham Streich
  • 874
  • 3
  • 15
  • 31

4 Answers4

1

Aside: Maybe something relevant- 01 invalid token

My solution:

def convert_str(strn):
    new_strn = [s.lstrip("0") for s in strn.split('/')] #to strip only leading 0's
    return ('/ ').join(new_strn)

So,

convert_str('0154/0924') #would return
'154/ 924'

Which is in the same format as 'citation' in the other data frame. Could make use of pandas apply function to 'apply' convert_str function on 'citation' column of first dataframe.

hkr
  • 270
  • 1
  • 11
0

Solution

You can use x.str.findall('(\d+)') where x is either the pandas.Dataframe column or a pandas.Series object. You can run this on both columns and extract the true numbers, with each row as a list of two numbers or none (if no number is present.

You could then concatenate the numbers into a single string:

num_pair_1 = df1.Values.str.findall('(\d+)')
num_pair_2 = df2.Values.str.findall('(\d+)')
a = num_pair_1.str.join('/') # for first data column
b = num_pair_2.str.join('/') # for second data column

And now finally compare a and b as they should not have any of those additional zeros or spaces.

# for a series s with the values
s.str.strip().str.findall('(\d+)')
# for a column 'Values' in a dataframe df
df.Values.str.findall('(\d+)')

Output

0            []
1    [154, 924]
2     [22, 320]
dtype: object

Data

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd

ss = """
154/ 924
22/ 3
"""
s = pd.Series(StringIO(ss))
df = pd.DataFrame(s.str.strip(), columns=['Values'])

Output

    Values
0   
1   154/ 924
2   22/ 320
CypherX
  • 7,019
  • 3
  • 25
  • 37
0

Convert the str to a list by str.split('/') and map to int:

  • int will remove the leading zeros
  • If the values in the list are different, df1['citation'] == df2['citation'] will compare as False by row
  • Requires no regular expressions or list comprehensions

Dataframe setup:

df1 = pd.DataFrame({'citation': ['0154/0924', '0022/0320']})
df2 = pd.DataFrame({'citation': ['154/ 924', '22/ 320']})

print(df1)

  citation
 0154/0924
 0022/0320

print(df2)

 citation
 154/ 924
  22/ 320

Split on / and set type to int:

def fix_citation(x):
    return list(map(int, x.split('/')))


df1['citation'] = df1['citation'].apply(fix_citation)
df2['citation'] = df2['citation'].apply(fix_citation)

print(df1)

   citation
 [154, 924]
  [22, 320]

print(df2)

   citation
 [154, 924]
  [22, 320]

Compare the columns:

df1 == df2

enter image description here

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
0

Here's a pattern that would filter both:

pattern = '[0\s]*(\d+)/[0\s]*(\d+)'

s = pd.Series(['0154/0924','0022/0320', '154/ 924', '22/ 320'])

s.str.extract('[0\s]*(\d+)/[0\s]*(\d+)')

Output:

     0    1
0  154  924
1   22  320
2  154  924
3   22  320
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74