-6

I need to match all the end-of-sentence symbols like !, ?, . (period), etc. in a given body of text.

Can anyone help me out with the regex for doing such a thing?

Example Input:

This is the f!!rst sentence! Is this the second one? The third sentence is here... And the fourth one!!

Output:

This is the f!!rst sentence Is this the second one The third sentence is here And the fourth one
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
user2306772
  • 5
  • 1
  • 4
  • 3
    While asking any question, please keep in mind to give sample input and expected output. This will prevent answers solely based on assumptions. – TheLostMind Sep 23 '14 at 08:19
  • 1
    What have you investigated and tried already? – Pieter21 Sep 23 '14 at 08:19
  • People will help you if you ask a question that's useful to more than just you. Which is not the case here. – Mena Sep 23 '14 at 08:24
  • 1
    Sorry, I was in a bit of a hurry as I needed this for a my University project, basically I am building a search engine prototype and I need to tokenize each word in the data source which contains 10,000+ news articles. One of the tasks of filtering is to remove symbols like !,?,. which occur at the end of sentences as markers. – user2306772 Sep 23 '14 at 08:33

3 Answers3

1
[!?.]+(?=$|\s)

Try this.You can add markers as needed.Replace by ``.

See demo.

http://regex101.com/r/lS5tT3/15

vks
  • 67,027
  • 10
  • 91
  • 124
0

You'd probably want to match anything (.*?) followed by the end of sentence followed by whitespace (\s+). Since !, ? and . are special characters, you'll need to excape them.

eg

Pattern pattern = Pattern.compile("(.*?)[\\!\\?\\.]\\s+");
Matcher matcher = pattern.matcher("one two. three! four five? ");
while (matcher.find()) {
   System.out.println(matcher.group(1));
}

prints

one two
three
four five
lance-java
  • 25,497
  • 4
  • 59
  • 101
  • This answer isn't particularly helpful to anyone. The question itself isn't clear. – TheLostMind Sep 23 '14 at 08:31
  • Seriously? I think I've explained enough to get the brain thinking. Next time I won't bother. – lance-java Sep 23 '14 at 08:33
  • I probably screwed up by asking the question in a very casual manner. I was in a hurry and needed some quick help, and didnt bother elaborating much. Anyways, I have made an edit. Lance Java, your answer was certainly helpful, but I suppose that is not what I was looking for. – user2306772 Sep 23 '14 at 08:41
0

The below regex would match the non-word characters (except spaces) which must be followed by a space character or end of the line anchor. replaceAll function helps to remove all the matched characters.

String s = "Blah! blah? blah... blah blah!!";
System.out.println(s.replaceAll("[^\\w\\s]+(?=\\s|$)", ""));

Output:

Blah blah blah blah blah

If you want to remove only ?, ., ! characters which was present at the last in a word, you could try the below code.

String s = "This is the f!!rst sentence! Is this the second one? The third sentence is here... And the fourth one!!";
System.out.println(s.replaceAll("[!?.]+(?=\\s|$)", ""));

Output:

This is the f!!rst sentence Is this the second one The third sentence is here And the fourth one
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274