3

Hello I am writing a Python program that reads through a given .txt file and looks for keywords. In this program once I have found my keyword (for example 'data') I would like to print out the entire sentence the word is associated with.

I have read in my input file and used the split() method to rid of spaces, tabs and newlines and put all the words into an array.

Here is the code I have thus far.

text_file = open("file.txt", "r")
lines = []
lines = text_file.read().split()
keyword = 'data'

for token in lines:
    if token == keyword:
         //I have found my keyword, what methods can I use to
        //print out the words before and after the keyword 
       //I have a feeling I want to use '.' as a marker for sentences
           print(sentence) //prints the entire sentence

file.txt Reads as follows

Welcome to SOF! This website securely stores data for the user.

desired output:

This website securely stores data for the user.
Giorgos Myrianthous
  • 36,235
  • 20
  • 134
  • 156

4 Answers4

2

We can just split text on characters that represent line endings and then loop trough those lines and print those who contain our keyword.

To split text on multiple characters , for example line ending can be marked with ! ? . we can use regex:

import re

keyword = "data"
line_end_chars = "!", "?", "."
example = "Welcome to SOF! This website securely stores data for the user?"
regexPattern = '|'.join(map(re.escape, line_end_chars))
line_list = re.split(regexPattern, example)

# line_list looks like this:
# ['Welcome to SOF', ' This website securely stores data for the user', '']

# Now we just need to see which lines have our keyword
for line in line_list:
    if keyword in line:
        print(line)

But keep in mind that: if keyword in line: matches a sequence of characters, not necessarily a whole word - for example, 'data' in 'datamine' is True. If you only want to match whole words, you ought to use regular expressions: source explanation with example

Source for regex delimiters

BrainDead
  • 786
  • 7
  • 16
2

My approach is similar to Alberto Poljak but a little more explicit.

The motivation is to realise that splitting on words is unnecessary - Python's in operator will happily find a word in a sentence. What is necessary is the splitting of sentences. Unfortunately, sentences can end with ., ? or ! and Python's split function does not allow multiple separators. So we have to get a little complicated and use re.

re requires us to put a | between each delimiter and escape some of them, because both . and ? have special meanings by default. Alberto's solution used re itself to do all this, which is definitely the way to go. But if you're new to re, my hard-coded version might be clearer.

The other addition I made was to put each sentence's trailing delimiter back on the sentence it belongs to. To do this I wrapped the delimiters in (), which captures them in the output. I then used zip to put them back on the sentence they came from. The 0::2 and 1::2 slices will take every even index (the sentences) and concatenate them with every odd index (the delimiters). Uncomment the print statement to see what's happening.

import re

lines = "Welcome to SOF! This website securely stores data for the user. Another sentence."
keyword = "data"

sentences = re.split('(\.|!|\?)', lines)

sentences_terminated = [a + b for a,b in zip(sentences[0::2], sentences[1::2])]

# print(sentences_terminated)

for sentence in sentences_terminated:
    if keyword in sentence:
        print(sentence)
        break

Output:

 This website securely stores data for the user.
Heath Raftery
  • 3,643
  • 17
  • 34
1

This solution uses a fairly simple regex in order to find your keyword in a sentence, with words that may or may not be before and after it, and a final period character. It works well with spaces and it's only one execution of re.search().

import re

text_file = open("file.txt", "r")
text = text_file.read()

keyword = 'data'

match = re.search("\s?(\w+\s)*" + keyword + "\s?(\w+\s?)*.", text)
print(match.group().strip())
0

Another Solution:

def check_for_stop_punctuation(token):
    stop_punctuation = ['.', '?', '!']
    for i in range(len(stop_punctuation)):
        if token.find(stop_punctuation[i]) > -1:
            return True
    return False

text_file = open("file.txt", "r")
lines = []
lines = text_file.read().split()
keyword = 'data'

sentence = []
stop_punctuation = ['.', '?', '!']

i = 0
while i < len(lines):
    token = lines[i]
    sentence.append(token)
    if token == keyword:
        found_stop_punctuation = check_for_stop_punctuation(token)
        while not found_stop_punctuation:
            i += 1
            token = lines[i]
            sentence.append(token)
            found_stop_punctuation = check_for_stop_punctuation(token)
        print(sentence)
        sentence = []
    elif check_for_stop_punctuation(token):
        sentence = []
    i += 1
Will Lacey
  • 91
  • 6