0

I have texts where delimiter can be anything in the list [;,.?]

txt1 = "Kids of today have started selling drugs or taken drugs at this age, then we are finished as parent,what generation are we going to have when our generation is no more,am sick to my stomach, it means we do not have tomorrow leaders or future leader, drugs at this stage woowowow parent and Guidance's fasten your belt if not we will wake up someday to see what we never thought could happen"
txt2 = "There was a clear warning sign, and this person chose to take a risk regardless. It was quite a stupid decision to climb the fence, but even this is probably a common activity that generally never results in death. More of a freak accident than a definite way for someone to die. At the very most, the only changes that should be made by the airport / authorities would be to the fence design, making it more difficult for people to climb up. Barricading the area off completely and banning people from the area would be comparable to fencing off a scenic mountain path that hundreds of people like to climb and enjoy safely, but which does produce the occasional fatality when people slip. Just because this area carries a (clearly communicated) risk shouldn't be a reason for the authorities to step in and make adjustments. People take risks and are responsible for their own safety in areas like this. One fatality is a tiny drop in the bucket compared to the hundreds of people doing this each month without incident."

How to break multiline sentences into independent sentences, depending on the presence of delimiter. For example, in txt1, delimiter should be ','(comma) whereas, in txt2, delimiter should be '.'(dot).

I have used re.split() for this, but I am not getting desired results. I used:

 print(re.split(';|,|.|?',txt1))
sshashank124
  • 31,495
  • 9
  • 67
  • 76
Scholar
  • 31
  • 1
  • 8
  • 2
    Does this answer your question? [Regular expression to match a dot](https://stackoverflow.com/questions/13989640/regular-expression-to-match-a-dot) – sshashank124 Jan 14 '20 at 04:57

4 Answers4

3

you have to add an escape character\ in front of . and ?

print(re.split(';|,|\.|\?',txt1))

to avoid the blank characters/empty strigs, do a list comprehension

[x for x in re.split(';|,|\.|\?',txt1) if x]
Shijith
  • 4,602
  • 2
  • 20
  • 34
1

Both dot and question mark are regex metacharacters, which means that these characters, when used unescaped, have a special meaning, and do not mean their literal values. One quick fix to your problem would be to split on a regex alternation:

print(re.split('[;,.?]', txt1))
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
0

try this:

import re
DATA = "sample, text"
print(re.split(r'[;,.?]+', DATA))
  • 3
    Please put your answer always in context instead of just pasting code. See [here](https://stackoverflow.com/help/how-to-answer) for more details. – gehbiszumeis Jan 14 '20 at 06:40
0

You can directly pass the list of delimiters if you have it.

Create a string out of list you have in the form of '[your delimiters]'

del_list = '[your delimiters]'
print(re.split('{0}'.format(del_list), txt1))
Prashant Kumar
  • 2,057
  • 2
  • 9
  • 22