Tokenise text and create more rows for each row in dataframe

Question

I want to do this with python and pandas.

Let's suppose that I have the following:

file_id   text
1         I am the first document. I am a nice document.
2         I am the second document. I am an even nicer document.

and I finally want to have the following:

file_id   text
1         I am the first document
1         I am a nice document
2         I am the second document
2         I am an even nicer document

So I want the text of each file to be splitted at every fullstop and to create new lines for each of the tokens of these texts.

What is the most efficient way to do this?

u can use `nltk.tokenize.sent_tokenize('text')` to split sentences. — Shijith, May 24 '19 at 09:59

jezrael · Accepted Answer · 2019-05-24T10:22:17.400

1

Use:

s = (df.pop('text')
      .str.strip('.')
      .str.split('\.\s+', expand=True)
      .stack()
      .rename('text')
      .reset_index(level=1, drop=True))

df = df.join(s).reset_index(drop=True)
print (df)
   file_id                         text
0        1      I am the first document
1        1         I am a nice document
2        2     I am the second document
3        2  I am an even nicer document

Explanation:

First use DataFrame.pop for extract column, remove last . by Series.str.rstrip and split by with Series.str.split with escape . because special regex character, reshape by DataFrame.stack for Series, DataFrame.reset_index and rename for Series for DataFrame.join to original.

edited May 24 '19 at 10:22

answered May 24 '19 at 10:08

jezrael

822,522
95
1,334
1,252

I was waiting for you @jezrael! Thanks for the answer (upvote). Not very easily readable again (at least for a non-expert in pandas). By the way, how your answer would change If I told you that you will have also to split the text every time you encounter a newline (\n) or a forward slash (/)? – Outcast May 24 '19 at 10:20
Also, by the way, would your code work if I have other columns too at the right of the text column? – Outcast May 24 '19 at 10:30
@PoeteMaudit - Yes, my code working with multiple columns too. – jezrael May 24 '19 at 10:31
Ok, cool, I mean that I have columns at the right of the text column but I do not want to do anything particular with them except for what I did with the file_id. I only want to split the text column. – Outcast May 24 '19 at 10:34

score 0 · Answer 2 · answered May 24 '19 at 10:07

df = pd.DataFrame( { 'field_id': [1,2], 
                    'text': ["I am the first document. I am a nice document.",
                             "I am the second document. I am an even nicer document."]})

df['sents'] = df.text.apply(lambda txt: [x for x in txt.split(".") if len(x) > 1])
df = df.set_index(['field_id']).apply(lambda x: 
                                      pd.Series(x['sents']),axis=1).stack().reset_index(level=1, drop=True)
df = df.reset_index()
df.columns = ['field_id','text']

Tokenise text and create more rows for each row in dataframe

2 Answers2

Linked