Split sentences in pandas into sentence number and words

Question

I have a pandas dataframe like this:

Text            start    end    entity     value
I love apple      7       11    fruit      apple
I ate potato      6       11    vegetable  potato

I have tried to use a for loop It's running slow and I don't think this is what we should do with pandas.

I want to create another pandas dataframe base on this like:

Sentence#         Word        Tag
  1                I         Object 
  1               love       Object
  1               apple      fruit
  2                I         Object
  2               ate        Object
  2               potato     vegetable

Split the text column into words and sentence numbers. Other than the entity word, the other words will be tagged as Object.

This is going to be much much harder if "value" has phrases or sentences (i.e., more than a single word). — cs95, Mar 31 '19 at 20:24
@coldspeed I do encounter this problem now that 'value' has phrases and sentences, do you happen to know the solution of this much harder problem? — Lykosz, Apr 10 '19 at 21:34
It is a much more involved solution... I recommend opening a new question. If you do not have an answer in 2 days, let me know and I'll instate a bounty on it. — cs95, Apr 10 '19 at 21:49

cs95 · Accepted Answer · 2019-03-31T20:31:21.947

6

Use split, stack and map:

u = df.Text.str.split(expand=True).stack()

pd.DataFrame({
    'Sentence': u.index.get_level_values(0) + 1, 
    'Word': u.values, 
    'Entity': u.map(dict(zip(df.value, df.entity))).fillna('Object').values
})

   Sentence    Word     Entity
0         1       I     Object
1         1    love     Object
2         1   apple      fruit
3         2       I     Object
4         2     ate     Object
5         2  potato  vegetable

Side note: If running v0.24 or later, please use .to_numpy() instead of .values.

edited Mar 31 '19 at 20:31

answered Mar 31 '19 at 20:18

cs95

379,657
97
704
746

1

Love this answer. Clean piece of code. Sidenote: pandas doc recommends `.to_numpy()` over `.values` – Erfan Mar 31 '19 at 20:28
@Erfan [I know ;D](https://stackoverflow.com/a/54508052/4909087) Only thing is, if OP is on an older version (very likely), `to_numpy` will just throw an attribute error. But I will make a note of that in the answer. – cs95 Mar 31 '19 at 20:30
1

I see makes sense. Shoudlve been more clear. I meant a sidenote as well. – Erfan Mar 31 '19 at 20:31
1

Btw thanks for linking that thread, great piece of information. Couldnt find this in the docs. Was wondering what the reason for deprecation of `.values` was. – Erfan Mar 31 '19 at 20:33

score 2 · Answer 2 · answered Mar 31 '19 at 20:15

I am using unnesting here after str.split

df.Text=df.Text.str.split(' ')
yourdf=unnesting(df,['Text'])
yourdf.loc[yourdf.Text.values!=yourdf.value.values,'entity']='object'
yourdf
     Text  start  end     entity   value
0       I      7   11     object   apple
0    love      7   11     object   apple
0   apple      7   11      fruit   apple
1       I      6   11     object  potato
1     ate      6   11     object  potato
1  potato      6   11  vegetable  potato

Wow, never saw that thread, awesome. Have faved in my bookmarks ;} — rafaelc, Mar 31 '19 at 20:17

score 2 · Answer 3 · answered Mar 31 '19 at 20:16

Using the expand function I posted in this thread, you can

df = expand(df, 'Text', sep=' ')

Then simple

df['Tag'] = np.where(df.Text.ne(df.value), ['Object'], df.entity)


>>> df[['Text', 'Tag']]

    Text    Tag
0   I       Object
1   love    Object
2   apple   fruit
3   I       Object
4   ate     Object
5   potato  vegetable

def expand(df, col, sep=','):
    r = df[col].str.split(sep)
    d = {c: df[c].values.repeat(r.str.len(), axis=0) for c in df.columns}
    d[col] = [i for sub in r for i in sub]
    return pd.DataFrame(d)

Split sentences in pandas into sentence number and words

3 Answers3

Linked