5

I have a dataframe that consists of two columns: ID and TEXT. Pretend data is below:

ID      TEXT
265     The farmer plants grain. The fisher catches tuna.
456     The sky is blue.
434     The sun is bright.
921     I own a phone. I own a book.

I know all nltk functions do not work on dataframes. How could sent_tokenize be applied to the above dataframe?

When I try:

df.TEXT.apply(nltk.sent_tokenize)  

The output is unchanged from the original dataframe. My desired output is:

TEXT
The farmer plants grain.
The fisher catches tuna.
The sky is blue.
The sun is bright.
I own a phone.
I own a book.

In addition, I would like to tie back this new (desired) dataframe to the original ID numbers like this (following further text cleansing):

ID    TEXT
265     'farmer', 'plants', 'grain'
265     'fisher', 'catches', 'tuna'
456     'sky', 'blue'
434     'sun', 'bright'
921     'I', 'own', 'phone'
921     'I', 'own', 'book'

This question is related to another of my questions here. Please let me know if I can provide anything to help clarify my question!

Community
  • 1
  • 1
OverflowingTheGlass
  • 2,324
  • 1
  • 27
  • 75
  • You could use `df["TEXT"].apply(nltk.sent_tokenize)`, but you won't get each sentence on a separate row. – alexis May 11 '17 at 19:24

2 Answers2

6

edit: as a result of warranted prodding by @alexis here is a better response

Sentence Tokenization

This should get you a DataFrame with one row for each ID & sentence:

sentences = []
for row in df.itertuples():
    for sentence in row[2].split('.'):
        if sentence != '':
            sentences.append((row[1], sentence))
new_df = pandas.DataFrame(sentences, columns=['ID', 'SENTENCE'])

Whose output looks like this:

enter image description here

split('.') will quickly break strings up into sentences if sentences are in fact separated by periods and periods are not being used for other things (e.g. denoting abbreviations), and will remove periods in the process. This will fail if there are multiple use cases for periods and/or not all sentence endings are denoted by periods. A slower but much more robust approach would be to use, as you had asked, sent_tokenize to split rows up by sentence:

sentences = []
for row in df.itertuples():
    for sentence in sent_tokenize(row[2]):
        sentences.append((row[1], sentence))
new_df = pandas.DataFrame(sentences, columns=['ID', 'SENTENCE'])

This produces the following output:

enter image description here

If you want to quickly remove periods from these lines you could do something like:

new_df['SENTENCE_noperiods'] = new_df.SENTENCE.apply(lambda x: x.strip('.'))

Which would yield:

enter image description here

You can also take the apply -> map approach (df is your original table):

df = df.join(df.TEXT.apply(sent_tokenize).rename('SENTENCES'))

Yielding:

enter image description here

Continuing:

sentences = df.SENTENCES.apply(pandas.Series)
sentences.columns = ['sentence {}'.format(n + 1) for n in sentences.columns]

This yields:

enter image description here

As our indices have not changed, we can join this back into our original table:

df = df.join(sentences)

enter image description here

Word Tokenization

Continuing with df from above, we can extract the tokens in a given sentence as follows:

df['sent_1_words'] = df['sentence 1'].apply(word_tokenize)

enter image description here

abe
  • 355
  • 2
  • 9
  • Your first set of code worked with a couple tweaks. Changed `row[1]` to `row[2]` and `row[0]` to `row[1]`. Then, I added `new_df['SENTENCE'].replace('', np.nan, inplace=True)` to replace blanks created by periods at the end of sentences in a record that only had one sentence. Finally, I added `new_df = new_df.dropna(subset=['SENTENCE'])` to drop those nulls. – OverflowingTheGlass May 11 '17 at 18:04
  • I'm not going to accept the solution because although it solves my inherent problem, it doesn't answer the crux of my written question which is how to use nltk's sent_tokenize on a dataframe. Please let me know if this is the wrong thing to do though - I'm new here! Thank you very much for your help. – OverflowingTheGlass May 11 '17 at 18:08
  • your indexing is correct - 0 is for the index. have updated my answer. Also, yeah I think that methodology looks fine for removing those blanks. – abe May 11 '17 at 18:40
  • no problem! totally up to you re: accepting, can't say I have done many of these either. If I have a min will try to run the sent_tokenize function. – abe May 11 '17 at 18:54
  • Thanks! The real issue here is converting back and forth between dataframes without losing the original attributes of the dataframe (i.e. the question I linked to in this question). That's the piece that I've been banging my head against for the past couple of days. – OverflowingTheGlass May 11 '17 at 18:56
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/144086/discussion-between-abe-and-alexis). – abe May 12 '17 at 16:19
  • I'm saying you had already answered them before, and both versions have the same shortcomings. But carry on. – alexis May 12 '17 at 17:01
1

This is a little complicated. I apply sentence tokenization first then go through each sentences and remove words from remove_words list and remove punctuation for each word inside.

import pandas as pd
from nltk import sent_tokenize
from string import punctuation

remove_words = ['the', 'an', 'a']

def remove_punctuation(chars):
    return ''.join([c for c in chars if c not in punctuation])

# example dataframe
df = pd.DataFrame([[265, "The farmer plants grain. The fisher catches tuna."],
                   [456, "The sky is blue."],
                   [434, "The sun is bright."],
                   [921, "I own a phone. I own a book."]], columns=['sent_id', 'text'])
df.loc[:, 'text_split'] = df.text.map(sent_tokenize)
sentences = []
for _, r in df.iterrows():
    for s in r.text_split:
        filtered_words = [remove_punctuation(w) for w in s.split() if w.lower() not in remove_words]
        # or using nltk.word_tokenize
        # filtered_words = [w for w in word_tokenize(s) if w.lower() not in remove_words and w not in punctuation]
        sentences.append({'sent_id': r.sent_id, 
                          'text': s.strip('.'), 
                          'words': filtered_words})
df_words = pd.DataFrame(sentences)

Output

+-------+--------------------+--------------------+
|sent_id|                text|               words|
+-------+--------------------+--------------------+
|    265|The farmer plants...|[farmer, plants, ...|
|    265|The fisher catche...|[fisher, catches,...|
|    456|     The sky is blue|     [sky, is, blue]|
|    434|   The sun is bright|   [sun, is, bright]|
|    921|       I own a phone|     [I, own, phone]|
|    921|        I own a book|      [I, own, book]|
+-------+--------------------+--------------------+
titipata
  • 5,321
  • 3
  • 35
  • 59