0

I've looked at this thread: Regex to find all sentences of text? but can't seem to get it to solve my exact scenario. Here's the text I'm working with:


import regex as re

sentence=re.compile("[A-Z].*?[\.!?] ", re.MULTILINE | re.DOTALL )

phrase = """For necessary expenses of the Office of Inspector 
General, including employment pursuant to the Inspector 
General Act of 1978 (Public Law 95–452; 5 U.S.C. App.), 
$99,912,000, including such sums as may be necessary for 
contracting and other arrangements with public agencies 
and private persons pursuant to section 6(a)(9) of the Inspector General Act of 1978 (Public Law 95–452; 5 
U.S.C. App.), and including not to exceed $125,000 for 
certain confidential operational expenses, including the 
payment of informants, to be expended under the direction 
of the Inspector General pursuant to the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. App.) and 
section 1337 of the Agriculture and Food Act of 1981. For necessary expenses of the Office of the General 
23 Counsel, $45,390,000."""

phrase = phrase.replace("\n", "")

sentence.findall(phrase)

# outputs:
['For necessary expenses of the Office of Inspector General, including employment pursuant to the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. ',
 'App.), $99,912,000, including such sums as may be necessary for contracting and other arrangements with public agencies and private persons pursuant to section 6(a)(9) of the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. ',
 'App.), and including not to exceed $125,000 for certain confidential operational expenses, including the payment of informants, to be expended under the direction of the Inspector General pursuant to the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. ',
 'App.) and section 1337 of the Agriculture and Food Act of 1981. ']

In this case, there are only 2 actual sentences in this long phrase. The first is:

For necessary expenses of the Office of Inspector General, including employment pursuant to the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. App.), $99,912,000, including such sums as may be necessary for contracting and other arrangements with public agencies and private persons pursuant to section 6(a)(9) of the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. App.), and including not to exceed $125,000 for certain confidential operational expenses, including the payment of informants, to be expended under the direction of the Inspector General pursuant to the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. App.) and section 1337 of the Agriculture and Food Act of 1981.

And the second is:

For necessary expenses of the Office of the General 23 Counsel, $45,390,000.

Is there a way, through regex or other means, to extract what I want? The end-goal is to be able to extract all of the full sentences, and then search them for certain things. (If that makes a difference on the solution)

Joshua Terrill
  • 1,995
  • 5
  • 21
  • 40
  • 2
    Not sure if a single regex can do that, since this is more of an NLP task. That being said, have a look at [this SO answer](https://stackoverflow.com/a/65507581/14739759) I posted some days ago. It might get you started. – anurag Jan 18 '21 at 06:31
  • You could try to remove all the parenthesis and their content, and use as a separator a regex to match point+space+uppercase letter – frab Jan 18 '21 at 06:34
  • 1
    Basically you cannot split these sentences with only first time processing. In this case, I will start with spliting `. ` (dot and space), `! `, `? `. Then go into any specific case like `Dr.`, `Ms. ` – Tấn Nguyên Jan 18 '21 at 06:39

2 Answers2

1

Try this

regex = "(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s"
re.split(regex, phrase)
Mitchell Olislagers
  • 1,758
  • 1
  • 4
  • 10
0
import re
print ([x for x in re.split(r"([A-Z].+(\(.+\)){0,1}.+)\.\s",s.replace("\n"," ")) if x])

Output:

['For necessary expenses of the Office of Inspector  General, including employment pursuant to the Inspector  General Act of 1978 (Public Law 95–452; 5 U.S.C. App.),  $99,912,000, including such sums as may be necessary for  contracting and other arrangements with public agencies  and private persons pursuant to section 6(a)(9) of the Inspector General Act of 1978 (Public Law 95–452; 5  U.S.C. App.), and including not to exceed $125,000 for  certain confidential operational expenses, including the  payment of informants, to be expended under the direction  of the Inspector General pursuant to the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. App.) and  section 1337 of the Agriculture and Food Act of 1981', 'For necessary expenses of the Office of the General  23 Counsel, $45,390,000.']

The regex is:

regex = r"([A-Z].+(\(.+\)){0,1}.+)\.\s"

re.split(r"([A-Z].+(\(.+\)){0,1}.+)\.\s",s.replace("\n"," "))
Synthase
  • 5,849
  • 2
  • 12
  • 34