Pandas dynamically pattern match from second dataframe and extract string

Question

Getting my knickers in a twist trying to dynamically build a regex extract pattern from a second dataframe list and populate another column with the string if it's contained in the list.

here are the two starting tables:

import pandas as pd
import re

# this is a short extract, there are 1000's of records in this table:
provinces = pd.DataFrame({'country': ['Brazil','Brazil','Brazil','Colombia','Colombia','Colombia'],
                  'area': ['Cerrado','Sul de Minas', 'Mococoa','Tolima','Huila','Quindio'],
                  'index': [13,21,19,35,36,34]})

# test dataframe
df_test = pd.DataFrame({'country':['brazil','brazil','brazil','brazil','colombia','colombia','brazil'],
                       'locality':['sul de minas minas gerais','chapadao cerrado','cerrado cerrado','mococa sao paulo','pitalito huila','pijao quindio','espirito santo']})

print(provinces)

    country          area  index
0    Brazil       Cerrado     13
1    Brazil  Sul de Minas     21
2    Brazil       Mococoa     19
3  Colombia        Tolima     35
4  Colombia         Huila     36
5  Colombia       Quindio     34

print(df_test)
    country                   locality
0    brazil  sul de minas minas gerais
1    brazil           chapadao cerrado
2    brazil            cerrado cerrado
3    brazil           mococa sao paulo
4  colombia             pitalito huila
5  colombia              pijao quindio
6    brazil             espirito santo

and end result:

df_result = pd.DataFrame({'country':['brazil','brazil','brazil','brazil','colombia','colombia','brazil'],
                       'locality':['minas gerais','chapadao','cerrado','sao paulo','pitalito','pijao','espirito santo'],
                         'area': ['sul de minas','cerrado','cerrado','mococoa','huila','quindio',''],
                         'index': [21,13,13,19,36,34,np.nan]})
print(df_result)
    country        locality          area  index
0    brazil    minas gerais  sul de minas   21.0
1    brazil        chapadao       cerrado   13.0
2    brazil         cerrado       cerrado   13.0
3    brazil       sao paulo       mococoa   19.0
4  colombia        pitalito         huila   36.0
5  colombia           pijao       quindio   34.0
6    brazil  espirito santo                  NaN

Can't get around the first step to populate the area column. Once the area column contains a string, stripping the same string from the locality column and adding the index column with a left join on the country and area columns is the easy part(!)

# to create the area column and extract the area string if there's a match (by string and country) in the provinces table
df_test['area'] = ''
df_test.area = df_test.locality.str.extract(flags=re.IGNORECASE, pat = r'(\b{}\b)'.format('|'.join(provinces.loc[provinces.country.str.lower()==df_test.country,'area'].str.lower().to_list()), expand=False))

and I'd also need to apply a map to exclude some records from this step.

# as above but for added complexity, populate the area column only if df_test.country == 'brazil':
df_test['area'] = ''
mapping = df_test.country =='brazil'
df_test.loc[mapping,'area'] = df_test.loc[mapping,'locality'].str.extract(flags=re.IGNORECASE, pat = r'(\b{}\b)'.format('|'.join(provinces.loc[provinces.country.str.lower()==df_test.country,'area'].str.lower().to_list()), expand=False))

All the vectorised regex extract solutions I've found rely on pre-defined regex patterns, but given these patterns need to come from the provinces dataframe where the countries match, this question and answer seemed like the closet match to this scenario but I couldn't make sense of it...

Thanks in advance

score 0 · Answer 1 · answered Feb 06 '20 at 08:28

following the trail of error messages (and sleep!), "Can only compare identically-labeled Series objects" resolved with this answer

And then "ValueError: Lengths must match to compare" with this answer

here's the solution:

df_test['area'] = ''
df_test.area = df_test.locality.str.extract(flags=re.IGNORECASE, pat = r'({})'.format('|'.join(provinces.loc[provinces.country.str.lower().isin(df_test.country),'area'].str.lower().to_list()), expand=False))

[out]

   country                   locality          area
0    brazil  sul de minas minas gerais  sul de minas
1    brazil           chapadao cerrado       cerrado
2    brazil            cerrado cerrado       cerrado
3    brazil          mococoa sao paulo       mococoa
4  colombia             pitalito huila         huila
5  colombia              pijao quindio       quindio
6    brazil             espirito santo           NaN

Pandas dynamically pattern match from second dataframe and extract string

1 Answers1