I have a weak grasp of Pandas and not a strong understanding of Python.
I am wanting to update a column (d.Alias
) based on the value of existing columns (d.Company
and d2.Alias
). d.Alias
should be equal to d2.Alias
if d2.Alias
is a substring of d.Company
.
Example datasets:
d = {'Company': ['The Cool Company Inc', 'Cool Company, Inc', 'The Cool
Company', 'The Shoe Company', 'Muffler Store', 'Muffler Store'],
'Position': ['Cool Job A', 'Cool Job B', 'Cool Job C', 'Salesman',
'Sales', 'Technician'],
'City': ['Tacoma', 'Tacoma','Tacoma', 'Boulder', 'Chicago', 'Chicago'],
'State': ['AZ', 'AZ', 'AZ', 'CO', 'IL', 'IL'],
'Alias': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]}
d2 = {'Company': ['The Cool Company, Inc.', 'The Shoe Company', 'Muffler
Store LLC'],
'Alias': ['Cool Company', np.nan, 'Muffler'],
'First Name': ['Carol', 'James', 'Frankie'],
'Last Name': ['Fisher', 'Smith', 'Johnson']}
The np.nan
for The Shoe Company
is because for that instance an alias is not necessary.
I have tried using .loc
, for
loops, while
loops, pandas.where
, numpy.where
, and several variations of each with no desirable outcomes. When using a for
loop, the end of d2.Alias
was copied to all rows in d.Alias
. I have not been able to reproduce that, however.
Previous posts that I have looked at which I wasn't able to get to work, or I didn't understand them: Conditionally fill column with value from another DataFrame based on row match in Pandas pandas create new column based on values from other columns
Any help is greatly appreciated!
EDIT:
Update:
After a few days of tinkering I reached the desired outcome. With Wen's response I had to change a couple of things.
First, I created a list from df2.Alias
called aliases
:
aliases = df2.Alias.unique()
Then, I had to remove .map(df2.set_index('Company').Alias
. The line that generated my desired resutls:
df1['Alias'] = df1.Company.apply(lambda x: [process.extract(x, aliases, limit=1)][0][0][0])
.