1

I have a large pandas dataframe with a lot of documents:

    id  text
1   doc2    Google i...
2   doc3    Amazon...
3   doc4    This was...
...
n   docN    nice camara...

How can I stack all the documents into sentences carrying out their respective id?:

    id  text
1   doc1   Google is a great company.
2   doc1   It is in silicon valley.
3   doc1   Their search engine is the best
4   doc2   Amazon is a great store.
5   doc2   it is located in Seattle.
6   doc2   its new product is alexa. 
5   doc2   its expensive.
5   doc3   This was a great product.
...
n   docN   nice camara I really liked it.

I tried to:

import nltk
def sentence(document):
    sentences = nltk.sent_tokenize(document.strip(' '))
    return sentences


df['sentece'] = df['text'].apply(sentence)
df.stack(level=0)

However, it did not worked. Any idea of how to stack the sentences carrying out their id of provenance?.

john doe
  • 2,233
  • 7
  • 37
  • 58
  • What is the difference between the first and the second frames? – DYZ Jan 04 '17 at 20:01
  • To follow up, it would be helpful if you provided a small, reproducible example to illustrate your problem. Take a look at [this post](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for tips. – lmo Jan 04 '17 at 20:04
  • The second is a dataframe conformed of all the sentences of the document, carrying out their respective id. @DYZ – john doe Jan 04 '17 at 20:05

3 Answers3

1

This iterates over each sentences with apply so that it can use nltk.sent_tokenize. Then it converts all the sentences into their own columns using the Series constructor.

df1 = df['text'].apply(lambda x: pd.Series(nltk.sent_tokenize(x)))
df1.set_index(df['id']).stack()

Example with fake data

df=pd.DataFrame({'id':['doc1', 'doc2'], 'text' :['This is a sentence. And another. And one more. cheers', 
                                                 'here are more sentences. yipee. woop.']})

df1 = df['text'].apply(lambda x: pd.Series(nltk.sent_tokenize(x)))
df1.set_index(df['id']).stack().reset_index().drop('level_1', axis=1)

     id                         0
0  doc1       This is a sentence.
1  doc1              And another.
2  doc1             And one more.
3  doc1                    cheers
4  doc2  here are more sentences.
5  doc2                    yipee.
6  doc2                     woop.
Ted Petrou
  • 59,042
  • 19
  • 131
  • 136
  • Also, I would like to split the text by sentence with nltk, spliting by period is fine. Nevertheless, it is not going to produce the same results as nltk function. – john doe Jan 04 '17 at 20:10
  • 1
    Check my updated answer that uses nltk.sent_tokenize – Ted Petrou Jan 04 '17 at 20:20
1

There is a solution to the problem that is similar to yours here: pandas: When cell contents are lists, create a row for each element in the list. Here's my interpretation of it with respect to your particular task:

df['sents'] = df['text'].apply(lambda x: nltk.sent_tokenize(x))
s = df.apply(lambda x: pd.Series(x['sents']), axis=1).stack().\
                                 reset_index(level=1, drop=True)
s.name = 'sents'
df = df.drop(['sents','text'], axis=1).join(s)
Community
  • 1
  • 1
DYZ
  • 55,249
  • 10
  • 64
  • 93
  • This actually worked. Thank you very much!, could you provide an explanation about the stack() usage and the reset_index?. – john doe Jan 04 '17 at 20:26
  • 1
    `.stack()` replaces Series column indexes with DataFrame row indexes, essentially "transposing" the Series. `.reset_index()` converts the second level of indexes into a column and then drops it. – DYZ Jan 04 '17 at 20:38
1

I think you would find this a lot easier if you kept your corps not in pandas. Here is my solution. I fit it back into a pandas data frame in the end. I think this is probably the most scalable solution.

def stack(one, two):
    sp = two.split(".")
    return [(one, a.strip()) for a in sp if len(a.strip()) > 0]

st = sum(map(stack, df['id'].tolist(),df['text'].tolist()),[])

df2 = pd.DataFrame(st)

df2.columns = ['id','text']

If you want to add a sentence Id column you can make a small tweak.

def stack(one, two):
    sp = two.split(".")
    return [(one, b, a.strip()) for a,b in zip(sp,xrange(1,len(sp)+1)) if len(a.strip()) > 0]

st = sum(map(stack, df['id'].tolist(),df['text'].tolist()),[])

df2 = pd.DataFrame(gen)

df2.columns = ['id','sentence_id','text']
Bray
  • 11
  • 2