for this example text, what would be the right regular expression to use for splitting it on sentences? The problem is that there is no dot after sentences, so you can only guess where a sentence starts by finding a capital letter and taking context into consideration. The desired output would be ['PATIENT CHARACTERISTICS', 'Age: 70 and under', 'Menopausal status: Not specified', ...]
"PATIENT CHARACTERISTICS: Age: 70 and under Menopausal status: Not specified Performance status: Not specified Life expectancy: Not specified Hematopoietic: Absolute neutrophil count at least 1,500/mm3 Platelet count at least 100,000/mm3 No CNS involvement Biologic therapy: See Disease At least 1 week since prior aspirin or anticoagulants except low dose anticoagulation to prevent catheter thrombosis"
The idea I had is to insert a dot before a capital letter with re.sub, and then splitting text into sentences on that dot by simply writing text = text.split('.'). However, there are some cases where that would not work, e.g. in the part "No CNS...". By doing that the sentence would be split like ['No', 'CNS...] which is not a desired output. I can 'brute force' it and make a list of starting words, but that is not the most optimal solution since there is a lot of text to be examined that way (this is only a very small part of it). If you have any suggestions please help.