-3

Any idea how to read a text file one sentence at a time instead of one line at a time.

The general idea would be to may be read-ahead and when the end-of-sentence is detected return the sentence.

Now here comes the tricky part EOS is normally a "dot", but not always.

Tools like spacy can detect EOL, but expect the whole document to be available.

If the logic is to be hidden as generator/iterator the code would look like ...

   with SentenceFile.open(....) as sf :
        for sent in sf.next_stentence() :
            .....
sten
  • 7,028
  • 9
  • 41
  • 63
  • Check https://stackoverflow.com/questions/27209278/reading-sentences-from-a-text-file-and-appending-into-a-list-with-python-3?rq=1 and https://stackoverflow.com/questions/20719247/open-file-and-read-sentence – CCBet Sep 05 '20 at 17:35

2 Answers2

1

That seems like a job for nltk. It can easily tokenize text by sentences and then you can loop over them.

import nltk

with open("text.txt", "r") as f:
    text = f.read()
    text_sentenced = nltk.sent_tokenize(text)
    for sentence in text_sentenced:
        # do what you want with this sentence

    
0

I ended using chained buffers : line buffer feeds a sentence buffer.. and then the iterator consumes the sentence buffer. Also I use spacy to detect sentence boundaries. The bad thing is I had to do it twice...

import re
import spacy
from collections import deque
from spacy.lang.en.stop_words import STOP_WORDS



class Corpus(object):

    nlp = spacy.load("en_core_web_sm")

    def __init__(self, fname, kind='sentence', lemmas=True, stop_words=True, filter_punct=True):

        self.fname = fname
        self.file = open(fname,'r')

        self.kind = kind
        self.filter_punct = filter_punct
        self.stop_words = stop_words
        self.lemmas = lemmas

        self.last_sent = ''
        self.sents_buf = deque()
        self.line_buf = deque()
        self.closed = False


    def __iter__(self): return self

    def __next__(self):

        if self.kind == 'line' :
            if len(self.line_buf) == 0 and self.buffer_lines() is False : raise StopIteration
            else : return self.line_buf.popleft()

        if self.kind == 'sentence' :
            if len(self.sents_buf) == 0 and self.buffer_sents() is False : raise StopIteration
            else : return self.sents_buf.popleft()

    def buffer_lines(self):
        if self.closed : return False
        i = 0
        while i < 10 :
            i += 1
            line = self.file.readline()
            if line == '' :
                self.file.close()
                self.closed = True
                self.line_buf.append('*') # add end marker, so last line is processed
                say('> closing file ....')
                return True
            self.line_buf.append(line)
        return True
        

    def buffer_sents(self):
        i = 0
        while i < 10 :
            i += 1
            full = True
            if len(self.line_buf) == 0 :    full = self.buffer_lines()

            if full :
                line = self.line_buf.popleft()
                if not re.match(r'^\s*$', line) :
                    txt = self.last_sent + line #prepend 
                    sents = [ s.text for s in list(Corpus.nlp(txt).sents) ]
                    self.last_sent = sents.pop() #pull the first .. to prepend later
                    self.sents_buf.extend(sents)            
            else : 
                if len(self.sents_buf) == 0 : return False

        return True
sten
  • 7,028
  • 9
  • 41
  • 63