I have a question about a merge of two pd.dataframes based on a strg-pattern in a column. There are some very helpful discussions on stackoverlow and I found an approach (Merge two dataframe if one string column is contained in another column in Pandas) that fits to my requirements very well.
This approach works perfect in my MWE.
# Target-df
df = pd.DataFrame({'Company':['MAC CHEM PRODUCTS (INDIA) PVT. LTD. Mumbai IN',
'SIEGFRIED LTD. Zofingen CH',
'SHANDONG JINYANG PHARMACEUTICAL CO., LTD. Zibo City CN',
'CHIFENG ARKER PHARMACEUTICAL TECHNOLOGY CO., LTD. Zibo CZ',
],
'Certificate+Number':['R1-CEP 2012-025 - Rev 02',
'R2-CEP 1996-036 - Rev 02',
'R0-CEP 2008-165 - Rev 00',
'R1-CEP 2002-193 - Rev 00',
],
'Substance':['Suxamethonium Chloride',
'Amitriptyline hydrochloride',
'Oxytetracycline hydrochloride',
'Ephedrine hydrochloride',
],
}
)
# print(df)
Company | Certificate+Number | Substance |
---|---|---|
MAC CHEM PRODUCTS (INDIA) PVT. LTD. Mumbai IN | R1-CEP 2012-025 - Rev 02 | Suxamethonium Chloride |
SIEGFRIED LTD. Zofingen CH | R2-CEP 1996-036 - Rev 02 | Amitriptyline hydrochloride |
SHANDONG JINYANG PHARMACEUTICAL CO., LTD. Zibo City CN | R0-CEP 2008-165 - Rev 00 | Oxytetracycline hydrochloride |
CHIFENG ARKER PHARMACEUTICAL TECHNOLOGY CO., LTD. Zibo CZ | R1-CEP 2002-193 - Rev 00 | Ephedrine hydrochloride |
Second, I have a huge df with information on cities, countries, country-codes etc. First, as a minimal example:
world_cities_min = pd.DataFrame({'Geoname ID':[1275339,
'2657915',
'1785286',
'3061344',
],
'City':['Mumbai',
'Zofingen',
'Zibo',
'Zibo',
],
'ASCII Name':['Mumbai',
'Zofingen',
'Zibo',
'City',
],
'Country':['India',
'Switzerland',
'China',
'Czech Republic',
],
'Alpha2':['IN',
'CH',
'CN',
'CZ',
],
})
#print(world_cities_min.head(5))
Geoname ID | City | ASCII Name | Country | Alpha2 |
---|---|---|---|---|
1275339 | Mumbai | Mumbai | India | IN |
2657915 | Zofingen | Zofingen | Switzerland | CH |
1785286 | Zibo | Zibo | China | CH |
3061344 | Zibo | City | Czech Republic | CZ |
Extract pattern to find city-names (according to the approach from source Merge two dataframe if one string column is contained in another column in Pandas
pat = '|'.join(r"\b{}\b".format(x) for x in world_cities_min['ASCII Name'])
# and create column in target-df according to the name of the city
df['ASCII Name']= df['Company'].str.extract('('+ pat + ')', expand=False)
#print(df)
However, when I use the complete df of worldcities, I get the following error: ValueError: Cannot set a DataFrame with multiple columns to the single column ASCII Name
# Once again, the original target-df
df = pd.DataFrame({'Company':['MAC CHEM PRODUCTS (INDIA) PVT. LTD. Mumbai IN',
'SIEGFRIED LTD. Zofingen CH',
'SHANDONG JINYANG PHARMACEUTICAL CO., LTD. Zibo City CN',
'CHIFENG ARKER PHARMACEUTICAL TECHNOLOGY CO., LTD. Zibo CZ',
],
'Certificate+Number':['R1-CEP 2012-025 - Rev 02',
'R2-CEP 1996-036 - Rev 02',
'R0-CEP 2008-165 - Rev 00',
'R1-CEP 2002-193 - Rev 00',
],
'Substance':['Suxamethonium Chloride',
'Amitriptyline hydrochloride',
'Oxytetracycline hydrochloride',
'Ephedrine hydrochloride',
],
}
)
Loading the complete df
url = 'https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/geonames-all-cities-with-a-population-1000/exports/csv?lang=en&timezone=Europe%2FBerlin&use_labels=true&delimiter=%3B'
column_names = ['Geoname ID',
'Name',
'ASCII Name',
'Alternate Names',
'Feature Class',
'Feature Code',
'Country Code',
'Country name EN',
'Country Code 2' ,
'Admin1 Code' ,
'Admin2 Code' ,
'Admin3 Code',
'Admin4 Code',
'Population',
'Elevation',
'DIgital Elevation Model',
'Timezone',
'Modification date',
'LABEL EN',
'Coordinates'
]
world_cities = pd.read_csv(url,
header=1,
sep=';',
names=column_names,
usecols = [
'Name',
'ASCII Name',
'Country Code' ,
'Country name EN',
'Coordinates'],
converters={
},
)
... doing the same thing:
pat = '|'.join(r"\b{}\b".format(x) for x in world_cities_min['ASCII Name'])
# and create column in target-df according to the name of the city
df['ASCII Name']= df['Company'].str.extract('('+ pat + ')', expand=False)
#print(df)
Leads to: ValueError: Cannot set a DataFrame with multiple columns to the single column ASCII Name
May I ask you to help me troubleshoot? Where is the issue in the complete df, and how can I deal with it? My overall goal is to keep the City, Country Name and Alpha2 code as separate columns. Unfortunately, the information is present in df['Company']
without a unique str-pattern
Thank you very much for any advice.