two step regex on pandas variable

Question

I have two data frames containing a common variable, 'citation'. I am trying to check if values of citation in one data frame are also values in the other data frame. The problem is that the variables are of different format. In one data frame the variables appear as:

0154/0924
0022/0320

whereas in the other data frame they appear as:

154/ 924
22/ 320

the differences being: 1) no zeros before the first non-zero integer of the number before the hyphen and 2) zeros that appear after the hyphen but before the first non-zero integer after the hyphen are replaced with spaces, ' ', in the second data frame.

I am trying to use a function and apply it, as shown in the code below, but I am having trouble with regex and I could not find documentation on this exact problem.

def Clean_citation(citation): 
    # Search for opening bracket in the name followed by 
    # any characters repeated any number of times 
    if re.search('\(.*', citation): 

        # Extract the position of beginning of pattern 
        pos = re.search('\(.*', citation).start() 

        # return the cleaned name 
        return citation[:pos] 

    else: 
        # if clean up needed return the same name 
        return citation 


df['citation'] = df['citation'].apply(Clean_citation)

score 1 · Accepted Answer · answered Sep 13 '19 at 21:10

Aside: Maybe something relevant- 01 invalid token

My solution:

def convert_str(strn):
    new_strn = [s.lstrip("0") for s in strn.split('/')] #to strip only leading 0's
    return ('/ ').join(new_strn)

So,

convert_str('0154/0924') #would return
'154/ 924'

Which is in the same format as 'citation' in the other data frame. Could make use of pandas apply function to 'apply' convert_str function on 'citation' column of first dataframe.

CypherX · Answer 2 · 2019-09-13T21:04:45.420

Solution

You can use x.str.findall('(\d+)') where x is either the pandas.Dataframe column or a pandas.Series object. You can run this on both columns and extract the true numbers, with each row as a list of two numbers or none (if no number is present.

You could then concatenate the numbers into a single string:

num_pair_1 = df1.Values.str.findall('(\d+)')
num_pair_2 = df2.Values.str.findall('(\d+)')
a = num_pair_1.str.join('/') # for first data column
b = num_pair_2.str.join('/') # for second data column

And now finally compare a and b as they should not have any of those additional zeros or spaces.

# for a series s with the values
s.str.strip().str.findall('(\d+)')

# for a column 'Values' in a dataframe df
df.Values.str.findall('(\d+)')

Output

0            []
1    [154, 924]
2     [22, 320]
dtype: object

Data

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd

ss = """
154/ 924
22/ 3
"""
s = pd.Series(StringIO(ss))
df = pd.DataFrame(s.str.strip(), columns=['Values'])

Output

    Values
0   
1   154/ 924
2   22/ 320

Trenton McKinney · Answer 3 · 2019-09-13T21:57:06.317

Convert the `str` to a `list` by `str.split('/')` and `map` to `int`:

int will remove the leading zeros
If the values in the list are different, df1['citation'] == df2['citation'] will compare as False by row
Requires no regular expressions or list comprehensions

Dataframe setup:

df1 = pd.DataFrame({'citation': ['0154/0924', '0022/0320']})
df2 = pd.DataFrame({'citation': ['154/ 924', '22/ 320']})

print(df1)

  citation
 0154/0924
 0022/0320

print(df2)

 citation
 154/ 924
  22/ 320

Split on `/` and set type to `int`:

def fix_citation(x):
    return list(map(int, x.split('/')))


df1['citation'] = df1['citation'].apply(fix_citation)
df2['citation'] = df2['citation'].apply(fix_citation)

print(df1)

   citation
 [154, 924]
  [22, 320]

print(df2)

   citation
 [154, 924]
  [22, 320]

Compare the columns:

df1 == df2

score 0 · Answer 4 · answered Sep 13 '19 at 21:17

Here's a pattern that would filter both:

pattern = '[0\s]*(\d+)/[0\s]*(\d+)'

s = pd.Series(['0154/0924','0022/0320', '154/ 924', '22/ 320'])

s.str.extract('[0\s]*(\d+)/[0\s]*(\d+)')

Output:

     0    1
0  154  924
1   22  320
2  154  924
3   22  320

two step regex on pandas variable

4 Answers4

Solution

Output

Data

Output

Convert the str to a list by str.split('/') and map to int:

Dataframe setup:

Split on / and set type to int:

Compare the columns:

Convert the `str` to a `list` by `str.split('/')` and `map` to `int`:

Split on `/` and set type to `int`: