0

I have a pandas dataframe representing a library. The columns represent meta data, such as author, title, year and text. The text column contains lists with the book text, where each list element represents a sentence in the book (see below)

     Author  Title   Text
0    Smith   ABC    ["This is the first sentence", "This is the second sentence"]
1    Green   XYZ    ["Also a sentence", "And the second sentence"]

I want to carry out some NLP analysis on the sentences. For individual examples I would use list comparisons, but how can I use list comparisons for the column in the most Pythonic way?

What I want to do is e.g. make a new column with a list of sentences containing the word "the", such as in this example: How to test if a string contains one of the substrings in a list, in pandas?

However, they use a dataframe with a string column not a list column.

user27074
  • 627
  • 1
  • 6
  • 20

1 Answers1

2

You can do this by using DataFrame.apply and regular expression.

import re
import pandas as pd

data = {
    'Author': ['Smith', 'Green'],
    'Title' : ['ABC', 'XYZ'],
    'Text' : [
        ["This is the first sentence", "This is the second sentence"],
        ["Also a sentence", "And the second sentence"]
    ]
}

df = pd.DataFrame(data)

tokens = [
    'first',
    'second',
    'th'
]

def find_token(text_list, re_pattern):
    result = [
        text
        for text in text_list
        if re.search(re_pattern, text.lower())
    ]
    if result:
        return result
    return

for token in tokens:
    re_pattern = re.compile(fr'(^|\s){token}($|\s)')
    df[token] = df['Text'].apply(lambda x: find_token(x, re_pattern))

re match with the token word.
So there must be a whitespace or start/end of sentence.
re.compile(r'(^|\s)') means whitespace or start.
re.compile(r'($|\s)') means whitespace or end.

If you use 'th' as a token, result would be None.

Use tokens as ['first', 'second', 'th'], the result is following.

  Author Title                                               Text  \
0  Smith   ABC  [This is the first sentence, This is the secon...   
1  Green   XYZ         [Also a sentence, And the second sentence]   

                          first                         second    th  
0  [This is the first sentence]  [This is the second sentence]  None  
1                          None      [And the second sentence]  None  
Kail9974
  • 36
  • 1
  • 4