3

I have a pandas dataframe like this:

Text            start    end    entity     value
I love apple      7       11    fruit      apple
I ate potato      6       11    vegetable  potato

I have tried to use a for loop It's running slow and I don't think this is what we should do with pandas.

I want to create another pandas dataframe base on this like:

Sentence#         Word        Tag
  1                I         Object 
  1               love       Object
  1               apple      fruit
  2                I         Object
  2               ate        Object
  2               potato     vegetable

Split the text column into words and sentence numbers. Other than the entity word, the other words will be tagged as Object.

rafaelc
  • 57,686
  • 15
  • 58
  • 82
Lykosz
  • 87
  • 1
  • 6
  • This is going to be much much harder if "value" has phrases or sentences (i.e., more than a single word). – cs95 Mar 31 '19 at 20:24
  • @coldspeed I do encounter this problem now that 'value' has phrases and sentences, do you happen to know the solution of this much harder problem? – Lykosz Apr 10 '19 at 21:34
  • It is a much more involved solution... I recommend opening a new question. If you do not have an answer in 2 days, let me know and I'll instate a bounty on it. – cs95 Apr 10 '19 at 21:49

3 Answers3

6

Use split, stack and map:

u = df.Text.str.split(expand=True).stack()

pd.DataFrame({
    'Sentence': u.index.get_level_values(0) + 1, 
    'Word': u.values, 
    'Entity': u.map(dict(zip(df.value, df.entity))).fillna('Object').values
})

   Sentence    Word     Entity
0         1       I     Object
1         1    love     Object
2         1   apple      fruit
3         2       I     Object
4         2     ate     Object
5         2  potato  vegetable

Side note: If running v0.24 or later, please use .to_numpy() instead of .values.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    Love this answer. Clean piece of code. Sidenote: pandas doc recommends `.to_numpy()` over `.values` – Erfan Mar 31 '19 at 20:28
  • @Erfan [I know ;D](https://stackoverflow.com/a/54508052/4909087) Only thing is, if OP is on an older version (very likely), `to_numpy` will just throw an attribute error. But I will make a note of that in the answer. – cs95 Mar 31 '19 at 20:30
  • 1
    I see makes sense. Shoudlve been more clear. I meant a sidenote as well. – Erfan Mar 31 '19 at 20:31
  • 1
    Btw thanks for linking that thread, great piece of information. Couldnt find this in the docs. Was wondering what the reason for deprecation of `.values` was. – Erfan Mar 31 '19 at 20:33
2

I am using unnesting here after str.split

df.Text=df.Text.str.split(' ')
yourdf=unnesting(df,['Text'])
yourdf.loc[yourdf.Text.values!=yourdf.value.values,'entity']='object'
yourdf
     Text  start  end     entity   value
0       I      7   11     object   apple
0    love      7   11     object   apple
0   apple      7   11      fruit   apple
1       I      6   11     object  potato
1     ate      6   11     object  potato
1  potato      6   11  vegetable  potato
BENY
  • 317,841
  • 20
  • 164
  • 234
2

Using the expand function I posted in this thread, you can

df = expand(df, 'Text', sep=' ')

Then simple

df['Tag'] = np.where(df.Text.ne(df.value), ['Object'], df.entity)


>>> df[['Text', 'Tag']]

    Text    Tag
0   I       Object
1   love    Object
2   apple   fruit
3   I       Object
4   ate     Object
5   potato  vegetable

def expand(df, col, sep=','):
    r = df[col].str.split(sep)
    d = {c: df[c].values.repeat(r.str.len(), axis=0) for c in df.columns}
    d[col] = [i for sub in r for i in sub]
    return pd.DataFrame(d)
rafaelc
  • 57,686
  • 15
  • 58
  • 82