0

I have long text lines and want to read text between two words. It is a big text file in a standard format as follows:

First Paragraph
(Empty line)
Random lines are in this file with different words. Some
random lines are in this file with some different words. Some words here
random lines are in this file with various different words. Many words
random lines are in this file with plenty of different words
(Empty line)    
Second Paragraph
(Empty line)

I am looking for text between Fist Paragraph and Second Paragraph. I have tried couple of approaches using spacy but unable to get what I want.

#Approach 1:  This approach doesn't return anything

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

pattern = [{'LOWER': 'first'}, {'LOWER': 'second'}]

matcher.add("FindParagrpah", None, pattern)

doc = nlp(random_text_from_file_in_String_format)

matches = matcher(doc)
for match_id, start, end in matches:
  string_id = nlp.vocab.strings[match_id]  # Get string representation
  span = doc[start:end]  # The matched span
  print(match_id, string_id, start, end, span.text)

#Approach 2: This returns the whole text instead of expected text between First and Second words.

import spacy

nlp = spacy.load("en_core_web_sm")

def custom_boundary(docx):
for token in docx[:-1]:
  if token.text == 'Second':
  docx[token.i+1].is_sent_start=True
return docx

nlp.add_pipe(custom_boundary,before='parser')

mysentence= nlp(text)

for sentence in mysentence.sents:
 print(sentence)

What am I doing wrong? Should I be using some other library? Any help would be appreciated. Thank you.

user3044240
  • 621
  • 19
  • 33
  • Does this https://stackoverflow.com/questions/3368969/find-string-between-two-substrings answer your question? – Jai Mar 13 '20 at 04:50
  • Tried https://github.com/alvations/lazyme, `per_section` function might help. – alvas Mar 13 '20 at 11:01

1 Answers1

0

If there is a empty line on the text data; maybe you can use some regex:

import re

DATA = """First Paragraph

Random lines are in this file with different words. Some
random lines are in this file with some different words. Some words here
random lines are in this file with various different words. Many words
random lines are in this file with plenty of different words

Second Paragraph
"""

paragraph = re.split(r"\n\n", DATA)
print(paragraph[1])

paragraph is going to be a list and it contains 3 elements;

if you print the secon element which is paragraph[1] output will be:

Random lines are in this file with different words. Some random lines are in this file with some different words. Some words here random lines are in this file with various different words. Many words random lines are in this file with plenty of different words