0

I have two datasets say df1 and df:

df1

    df1 = pd.DataFrame({'ids': [101,102,103],'vals': ['apple','java','python']})

   ids    vals
0  101   apple
1  102    java
2  103  python

df

df = pd.DataFrame({'TEXT_DATA': [u'apple a day keeps doctor away', u'apple tree in my farm', u'python is not new language', u'Learn python programming', u'java is second language']})

                       TEXT_DATA
0  apple a day keeps doctor away
1          apple tree in my farm
2     python is not new language
3       Learn python programming
4        java is second language

What I want to do is want to update the columns values based on filtered data and map the match data to the new column such that my output is

                       TEXT_DATA      NEW_COLUMN
0  apple a day keeps doctor away      101
1          apple tree in my farm      101
2     python is not new language      103
3       Learn python programming      103
4        java is second language      102

I tried matching using

df[df['TEXT_DATA'].str.contains("apple")]

is there any way by which i can do this?

Andre_k
  • 1,680
  • 3
  • 18
  • 41

3 Answers3

1

You could do something like this:

my_words = {'python': 103, 'apple': 101, 'java': 102}
for word in my_words.keys():
    df1.loc[df1['my_column'].str.contains(word, na=False), ['my_second_column']] = my_words[word]
NoSplitSherlock
  • 605
  • 4
  • 19
  • This gives me error "ValueError: cannot index with vector containing NA / NaN values". IF possible you could me an output for this – Andre_k Mar 13 '19 at 13:59
  • Edited answer to remedy NaN values. – NoSplitSherlock Mar 13 '19 at 14:03
  • "my_words = {'python': 103, 'apple': 101, 'java': 102} for word in my_words.keys(): df.loc[df['TEXT_DATA'].str.contains(word, na=False), ['TEXT_DATA']] = my_words[word] " overwrites the value, it doesn't add another column – Andre_k Mar 13 '19 at 14:07
  • Output from your code " TEXT_DATA 0 101 1 101 2 103 3 103 4 102" – Andre_k Mar 13 '19 at 14:07
  • 1
    I edited the answer, but this is something you could really have researched yourself by looking at the answer for a couple of minutes. – NoSplitSherlock Mar 13 '19 at 14:08
1

First, you need to extract the values in df1['vals']. Then, create a new column and add the extraction result to the new column. And finally, merge both dataframes.

extr = '|'.join(x for x in df1['vals'])
df['vals'] = df['TEXT_DATA'].str.extract('('+ extr + ')', expand=False)
newdf = pd.merge(df, df1, on='vals', how='left')

To select the fields in the result, type the column name in the header section:

newdf[['TEXT_DATA','ids']]
YusufUMS
  • 1,506
  • 1
  • 12
  • 24
0

You could use a cartesian product of both dataframes and then select the relevant rows and columns.

tmp = df.assign(key=1).merge(df1.assign(key=1), on='key').drop(columns='key')
resul = tmp.loc[tmp.apply(func=(lambda x: x.vals in x.TEXT_DATA), axis=1)]\
                              .drop(columns='vals').reset_index(drop=True)
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252