Getting my knickers in a twist trying to dynamically build a regex extract pattern from a second dataframe list and populate another column with the string if it's contained in the list.
here are the two starting tables:
import pandas as pd
import re
# this is a short extract, there are 1000's of records in this table:
provinces = pd.DataFrame({'country': ['Brazil','Brazil','Brazil','Colombia','Colombia','Colombia'],
'area': ['Cerrado','Sul de Minas', 'Mococoa','Tolima','Huila','Quindio'],
'index': [13,21,19,35,36,34]})
# test dataframe
df_test = pd.DataFrame({'country':['brazil','brazil','brazil','brazil','colombia','colombia','brazil'],
'locality':['sul de minas minas gerais','chapadao cerrado','cerrado cerrado','mococa sao paulo','pitalito huila','pijao quindio','espirito santo']})
print(provinces)
country area index
0 Brazil Cerrado 13
1 Brazil Sul de Minas 21
2 Brazil Mococoa 19
3 Colombia Tolima 35
4 Colombia Huila 36
5 Colombia Quindio 34
print(df_test)
country locality
0 brazil sul de minas minas gerais
1 brazil chapadao cerrado
2 brazil cerrado cerrado
3 brazil mococa sao paulo
4 colombia pitalito huila
5 colombia pijao quindio
6 brazil espirito santo
and end result:
df_result = pd.DataFrame({'country':['brazil','brazil','brazil','brazil','colombia','colombia','brazil'],
'locality':['minas gerais','chapadao','cerrado','sao paulo','pitalito','pijao','espirito santo'],
'area': ['sul de minas','cerrado','cerrado','mococoa','huila','quindio',''],
'index': [21,13,13,19,36,34,np.nan]})
print(df_result)
country locality area index
0 brazil minas gerais sul de minas 21.0
1 brazil chapadao cerrado 13.0
2 brazil cerrado cerrado 13.0
3 brazil sao paulo mococoa 19.0
4 colombia pitalito huila 36.0
5 colombia pijao quindio 34.0
6 brazil espirito santo NaN
Can't get around the first step to populate the area column. Once the area column contains a string, stripping the same string from the locality column and adding the index column with a left join on the country and area columns is the easy part(!)
# to create the area column and extract the area string if there's a match (by string and country) in the provinces table
df_test['area'] = ''
df_test.area = df_test.locality.str.extract(flags=re.IGNORECASE, pat = r'(\b{}\b)'.format('|'.join(provinces.loc[provinces.country.str.lower()==df_test.country,'area'].str.lower().to_list()), expand=False))
and I'd also need to apply a map to exclude some records from this step.
# as above but for added complexity, populate the area column only if df_test.country == 'brazil':
df_test['area'] = ''
mapping = df_test.country =='brazil'
df_test.loc[mapping,'area'] = df_test.loc[mapping,'locality'].str.extract(flags=re.IGNORECASE, pat = r'(\b{}\b)'.format('|'.join(provinces.loc[provinces.country.str.lower()==df_test.country,'area'].str.lower().to_list()), expand=False))
All the vectorised regex extract solutions I've found rely on pre-defined regex patterns, but given these patterns need to come from the provinces dataframe where the countries match, this question and answer seemed like the closet match to this scenario but I couldn't make sense of it...
Thanks in advance