-2

for this example text, what would be the right regular expression to use for splitting it on sentences? The problem is that there is no dot after sentences, so you can only guess where a sentence starts by finding a capital letter and taking context into consideration. The desired output would be ['PATIENT CHARACTERISTICS', 'Age: 70 and under', 'Menopausal status: Not specified', ...]

"PATIENT CHARACTERISTICS: Age: 70 and under Menopausal status: Not specified Performance status: Not specified Life expectancy: Not specified Hematopoietic: Absolute neutrophil count at least 1,500/mm3 Platelet count at least 100,000/mm3 No CNS involvement Biologic therapy: See Disease At least 1 week since prior aspirin or anticoagulants except low dose anticoagulation to prevent catheter thrombosis"

The idea I had is to insert a dot before a capital letter with re.sub, and then splitting text into sentences on that dot by simply writing text = text.split('.'). However, there are some cases where that would not work, e.g. in the part "No CNS...". By doing that the sentence would be split like ['No', 'CNS...] which is not a desired output. I can 'brute force' it and make a list of starting words, but that is not the most optimal solution since there is a lot of text to be examined that way (this is only a very small part of it). If you have any suggestions please help.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Ayperos23
  • 19
  • 5
  • I have not seen this use case in regex, perhaps don't limit your solution to that library and generalize your question to splitting with any python tool available. – RichieV Sep 13 '20 at 17:57
  • 1
    Any particular reason you want to use regex? There are several great libraries with sentence splitting built in for Python, such as [spaCy](https://spacy.io/). Sentence splitting is surprisingly difficult, and it looks like your data is going to have some extra difficulties. It would probably be a good idea to start with an existing library and continue to process its output instead of starting from scratch. – dantiston Sep 13 '20 at 18:49

1 Answers1

-1

You would probably need to start with creating a dictionary of probable start and/or and stop words. You need to identify patterns of those and then only regex can help.

DS_
  • 247
  • 2
  • 10