Join pandas dataframe by strg-pattern

Question

I have a question about a merge of two pd.dataframes based on a strg-pattern in a column. There are some very helpful discussions on stackoverlow and I found an approach (Merge two dataframe if one string column is contained in another column in Pandas) that fits to my requirements very well.

This approach works perfect in my MWE.

# Target-df  
df = pd.DataFrame({'Company':['MAC CHEM PRODUCTS (INDIA) PVT. LTD. Mumbai IN',
                              'SIEGFRIED LTD. Zofingen CH',
                              'SHANDONG JINYANG PHARMACEUTICAL CO., LTD. Zibo City CN',
                              'CHIFENG ARKER PHARMACEUTICAL TECHNOLOGY CO., LTD. Zibo CZ', 
                               ], 
                   'Certificate+Number':['R1-CEP 2012-025 - Rev 02',
                                         'R2-CEP 1996-036 - Rev 02',
                                         'R0-CEP 2008-165 - Rev 00',
                                         'R1-CEP 2002-193 - Rev 00',
                                          ],
                   'Substance':['Suxamethonium Chloride',
                                'Amitriptyline hydrochloride',
                                'Oxytetracycline hydrochloride',
                                'Ephedrine hydrochloride', 
                                 ], 
                       }
                       )

# print(df)

Company	Certificate+Number	Substance
MAC CHEM PRODUCTS (INDIA) PVT. LTD. Mumbai IN	R1-CEP 2012-025 - Rev 02	Suxamethonium Chloride
SIEGFRIED LTD. Zofingen CH	R2-CEP 1996-036 - Rev 02	Amitriptyline hydrochloride
SHANDONG JINYANG PHARMACEUTICAL CO., LTD. Zibo City CN	R0-CEP 2008-165 - Rev 00	Oxytetracycline hydrochloride
CHIFENG ARKER PHARMACEUTICAL TECHNOLOGY CO., LTD. Zibo CZ	R1-CEP 2002-193 - Rev 00	Ephedrine hydrochloride

Second, I have a huge df with information on cities, countries, country-codes etc. First, as a minimal example:

world_cities_min = pd.DataFrame({'Geoname ID':[1275339,
                                 '2657915',
                                 '1785286',
                                 '3061344', 
                                 ], 
                                  'City':['Mumbai',
                                          'Zofingen',
                                          'Zibo',
                                          'Zibo',
                                           ],
                                  'ASCII Name':['Mumbai',
                                                'Zofingen',
                                                'Zibo',
                                                'City', 
                                                 ], 
                                  'Country':['India',
                                             'Switzerland',
                                             'China',
                                             'Czech Republic', 
                                            ],
                                  'Alpha2':['IN',
                                            'CH',
                                            'CN',
                                            'CZ', 
                                            ], 
                               })
    
#print(world_cities_min.head(5))

Geoname ID	City	ASCII Name	Country	Alpha2
1275339	Mumbai	Mumbai	India	IN
2657915	Zofingen	Zofingen	Switzerland	CH
1785286	Zibo	Zibo	China	CH
3061344	Zibo	City	Czech Republic	CZ

Extract pattern to find city-names (according to the approach from source Merge two dataframe if one string column is contained in another column in Pandas

pat = '|'.join(r"\b{}\b".format(x) for x in world_cities_min['ASCII Name'])

# and create column in target-df according to the name of the city
df['ASCII Name']= df['Company'].str.extract('('+ pat + ')', expand=False)
    
#print(df)

However, when I use the complete df of worldcities, I get the following error: ValueError: Cannot set a DataFrame with multiple columns to the single column ASCII Name

# Once again, the original target-df  
df = pd.DataFrame({'Company':['MAC CHEM PRODUCTS (INDIA) PVT. LTD. Mumbai IN',
                              'SIEGFRIED LTD. Zofingen CH',
                              'SHANDONG JINYANG PHARMACEUTICAL CO., LTD. Zibo City CN',
                              'CHIFENG ARKER PHARMACEUTICAL TECHNOLOGY CO., LTD. Zibo CZ', 
                               ], 
                   'Certificate+Number':['R1-CEP 2012-025 - Rev 02',
                                         'R2-CEP 1996-036 - Rev 02',
                                         'R0-CEP 2008-165 - Rev 00',
                                         'R1-CEP 2002-193 - Rev 00',
                                          ],
                   'Substance':['Suxamethonium Chloride',
                                'Amitriptyline hydrochloride',
                                'Oxytetracycline hydrochloride',
                                'Ephedrine hydrochloride', 
                                 ], 
                       }
                       )

Loading the complete df

url = 'https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/geonames-all-cities-with-a-population-1000/exports/csv?lang=en&timezone=Europe%2FBerlin&use_labels=true&delimiter=%3B'


column_names = ['Geoname ID',
                'Name', 
                'ASCII Name',   
                'Alternate Names',
                'Feature Class',
                'Feature Code',
                'Country Code',
                'Country name EN',  
                'Country Code 2'    ,
                'Admin1 Code'   ,
                'Admin2 Code'   ,
                'Admin3 Code',  
                'Admin4 Code',  
                'Population',
                'Elevation',    
                'DIgital Elevation Model',  
                'Timezone', 
                'Modification date',    
                'LABEL EN', 
                'Coordinates'
                 ]
    
world_cities  = pd.read_csv(url,
                        header=1,
                        sep=';',
                          names=column_names,
                          usecols = [
                                    'Name', 
                                    'ASCII Name',   
                                    'Country Code'  ,
                                    'Country name EN',  
                                    'Coordinates'],
                            converters={
                                        },
                          )

... doing the same thing:

pat = '|'.join(r"\b{}\b".format(x) for x in world_cities_min['ASCII Name'])

# and create column in target-df according to the name of the city
df['ASCII Name']= df['Company'].str.extract('('+ pat + ')', expand=False)
    
#print(df)

Leads to: ValueError: Cannot set a DataFrame with multiple columns to the single column ASCII Name

May I ask you to help me troubleshoot? Where is the issue in the complete df, and how can I deal with it? My overall goal is to keep the City, Country Name and Alpha2 code as separate columns. Unfortunately, the information is present in df['Company'] without a unique str-pattern

Thank you very much for any advice.

Shubham Sharma · Accepted Answer · 2023-03-26T05:29:41.330

Cause of the error

The larger world_cities dataframe contains characters in some city names that have a special meaning in regular expressions. For instance, some of these names contain parentheses (), which has a special meaning and are used to denote capturing groups. Have a look at the following ASCII names which are captured from the world_cities

                  Name      ASCII Name Country Code Country name EN         Coordinates
84328      Hamm (Sieg)     Hamm (Sieg)           DE         Germany   50.76531, 7.67761
63174    Obolo-Eke (1)   Obolo-Eke (1)           NG         Nigeria    6.88333, 7.63333
50291    Halle (Saale)   Halle (Saale)           DE         Germany  51.48158, 11.97947
126292  Seen (Kreis 3)  Seen (Kreis 3)           CH     Switzerland   47.47646, 8.76996
131692  Schwedt (Oder)  Schwedt (Oder)           DE         Germany  53.05963, 14.28154

Solution

import re 

# ensure null values are dropped
cities = world_cities['ASCII Name'].dropna()

# Escape the special regex reserved characters in city names
pat = r'\b(%s)\b' % '|'.join(map(re.escape, cities))

# extract the matching occurences of the regex pattern
df['ASCII name'] = df['Company'].str.extract(pat, expand=False)

Result

                                                     Company        Certificate+Number                      Substance ASCII name
0              MAC CHEM PRODUCTS (INDIA) PVT. LTD. Mumbai IN  R1-CEP 2012-025 - Rev 02         Suxamethonium Chloride     Mumbai
1                                 SIEGFRIED LTD. Zofingen CH  R2-CEP 1996-036 - Rev 02    Amitriptyline hydrochloride   Zofingen
2     SHANDONG JINYANG PHARMACEUTICAL CO., LTD. Zibo City CN  R0-CEP 2008-165 - Rev 00  Oxytetracycline hydrochloride       Zibo
3  CHIFENG ARKER PHARMACEUTICAL TECHNOLOGY CO., LTD. Zibo CZ  R1-CEP 2002-193 - Rev 00        Ephedrine hydrochloride       Zibo

Wow, good catch - I have seen the special characters. But it didn't cross my mind that this would cause problems with regex :facepalm: Thanks for the hint and the solution - works perfectly. Many thanks! Just a small question on the side: how did you find the sample lines in question so quickly? How did you figure out that it's because of the regex pattern? The error message doesn't give any hint... — Paul G., Mar 26 '23 at 07:23
Glad to help, @PaulG.! Actually, there is a hint in the error message which says, 'Cannot set a DataFrame with multiple columns to the single column.' Basically, `str.extract` will only produce multiple columns when you have multiple capturing groups in the pattern. Based on this hypothesis, I used the following code to check for rows that have parentheses in the name column: `world_cities[world_cities['ASCII Name'].str.contains('\(', na=False)]`. — Shubham Sharma, Mar 26 '23 at 07:28
really appreaciate your support and the explenation, @Shubham Sharma. This was really helpful! — Paul G., Mar 26 '23 at 07:51

Join pandas dataframe by strg-pattern

1 Answers1

Cause of the error

Solution