I try to solve the following problem:
I have these two dataframes, df1
:
label | term |
---|---|
A01.12 | "Culture de légumes, de melons, de racines et de tubercules" |
A01.14 | Culture de la canne à sucre |
and df2
:
company | organization_activity |
---|---|
coca-cola | "Culture de légumes, de melons, de racines et de tubercules" |
rum-factory | Culture de la canne à sucre |
I now would like to map the "term" of df1
to the "organization_activity" of df2
and create a new column with the name "codes" in df2
with the respective label in df1
.
The resulting dataframe should look like this:
organization_activity | codes |
---|---|
"Culture de légumes, de melons, de racines et de tubercules" | A01.12 |
Culture de la canne à sucre | A01.14 |
This partially works by using:
df2['codes'] = df2['organization_activity'].map(df1.set_index('term')['label'].to_dict())
Oddly the new column is created and filled but for a lot of terms the mapping does not work and a Nan
value is inserted even though the "organization_activity" I am mapping exists in the "term" column. Does anyone have an idea on why this could happen? The formatting for the values in "term" and "organization_activity" is equal and there doesn't seem to be a difference in the format of values that are mappable and ones that are not. Is there maybe a different way to do this?
The dataframe I showed is just an abstract so there are multiple companies that can have the same "organization_activity".