0

Just a follow-up on the code provided by TennisVisuals in this discussion: Python split text on sentences I tried to parse the following paragraph in two sentences but the code (see the referred link) did not work. I was wondering if somebody else can reproduce the error.

The error I get is that the parser gives a len number of 1 item in the list of sentences for the paragraph, as if the period is not recognized as a sentence delimiter.

TwoSentencesParagraph = "The Minister must prepare an annual report on the implementation of specific programs. The report is included in the annual management report of the Ministere de l’Emploi et de la Solidarite sociale." The code is provided in the discussion Python split text on sentences.

It contains these lines (among several others):

def find_sentences(paragraph):
    end = True
    sentences = []
    while end > -1:
        end = find_sentence_end(paragraph)
        if end > -1:
            sentences.append(paragraph[end:].strip())
            paragraph = paragraph[:end]
    sentences.append(paragraph)
    sentences.reverse()
    return sentences
E.Poirier
  • 31
  • 4
  • 1
    When providing a [MCVE], the code must be included in the body of the question as text so people don't have to click all over to figure out what you're talking about. – ShadowRanger May 14 '19 at 22:50
  • solution @DGreenberg is working fine – sahasrara62 May 14 '19 at 23:10
  • Sorry for the missing code. I copied it but it was refused because not formatted correctly it seems. The code i am referring to is the code suggested by TennisVisuals which doesn't rely on regex. – E.Poirier May 15 '19 at 02:16
  • The code has these lines: def find_sentences(paragraph): end = True sentences = [] while end > -1: end = find_sentence_end(paragraph) if end > -1: sentences.append(paragraph[end:].strip()) paragraph = paragraph[:end] sentences.append(paragraph) sentences.reverse() return sentences – E.Poirier May 15 '19 at 02:22

1 Answers1

0

You did not put your code in the question, but I tried your input on the accepted answer at the link(I am assuming that is the code you used). I found that I did have to add a line of code and a set of parentheses to get it to run, but from your question it sounded like the program ran but failed. When I ran it it did succeed.

the code listed in the answer:

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))

The code I ran which succeeded:

import nltk.data
nltk.download('punkt')

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print ('\n-----\n'.join(tokenizer.tokenize(data)))

The program's output:

The Minister must prepare an annual report on the implementation of specific 
programs.
-----
The report is included in the annual management report of the Ministere de l’Emploi 
et de la Solidarite sociale.

I would like to mention that for this code, the input must be in the .txt file, and the output will be to the console.

If I have missed something or any of my assumptions were wrong, please let me know so I can try to fix it. Adding more information to your answer and relying less on links will probably help you get more accurate and relevant answers. For example, there are many ways a program can fail, so an explanation and/or a sample output and expected output can go a long way.

  • Sorry like I said on the previous answer i tried to copy and paste the code but it did not work out for incorrect formatting. I am referring to the code of TennisVisuals which does not rely on regex expressions and I did not refer to the nltk code. – E.Poirier May 15 '19 at 02:17
  • the code starts with this line (apart from the lists at the beginning def find_sentences(paragraph): end = True sentences = [] while end > -1: end = find_sentence_end(paragraph) if end > -1: sentences.append(paragraph[end:].strip()) paragraph = paragraph[:end] sentences.append(paragraph) sentences.reverse() return sentences – E.Poirier May 15 '19 at 02:18