33

So I'm completely new to regular expressions, and I'm trying to use Java's java.util.regex to find punctuation in input strings. I won't know what kind of punctuation I might get ahead of time, except that (1) !, ?, ., ... are all valid puncutation, and (2) "<" and ">" mean something special, and don't count as punctuation. The program itself builds phrases pseudo-randomly, and I want to strip off the punctuation at the end of a sentence before it goes through the random process.

I can match entire words with any punctuation, but the matcher just gives me indexes for that word. In other words:

Pattern p = Pattern.compile("(.*\\!)*?");
Matcher m = p.matcher([some input string]);

will grab any words with a "!" on the end. For example:

String inputString = "It is a warm Summer day!";
Pattern p = Pattern.compile("(.*\\!)*?");
Matcher m = p.matcher(inputString);
String match = inputString.substring(m.start(), m.end());

results in --> String match ~ "day!"

But I want to have Matcher index just the "!", so I can just split it off.

I could probably make cases, and use String.substring(...) for each kind of punctuation I might get, but I'm hoping there's some mistake in my use of regular expressions to do this.

Paulo Mattos
  • 18,845
  • 10
  • 77
  • 85
Mister R2
  • 861
  • 5
  • 12
  • 22

4 Answers4

46

Java does support POSIX character classes in a roundabout way. For punctuation, the Java equivalent of [:punct:] is \p{Punct}.

Please see the following link for details.

Here is a concrete, working example that uses the expression in the comments

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexFindPunctuation {

    public static void main(String[] args) {
        Pattern p = Pattern.compile("\\p{Punct}");

        Matcher m = p.matcher("One day! when I was walking. I found your pants? just kidding...");
        int count = 0;
        while (m.find()) {
            count++;
            System.out.println("\nMatch number: " + count);
            System.out.println("start() : " + m.start());
            System.out.println("end()   : " + m.end());
            System.out.println("group() : " + m.group());
        }
    }
}
EdgeCase
  • 4,719
  • 16
  • 45
  • 73
  • I tried running Pattern.compile("\\p{Punct}") (following the double-escape mentioned in that link you gave), but it doesn't find any punctuation, either. Specifically, I ran the following code: String input = "One day! when I was walking. I found your pants? just kidding..."; Pattern p = Pattern.compile("\\p{Punct}"); Matcher m = p.matcher(input); – Mister R2 Jul 29 '12 at 01:51
  • 2
    Same issue as above, use `Matcher.find()`. Note that this is much better regarding (memory) performance as returning all matches. If you simply want to match a whole string you may as well write `"input".matches("pattern")` by the way. – Maarten Bodewes Jul 29 '12 at 09:19
27

I would try a character class regex similar to

"[.!?\\-]"

Add whatever characters you wish to match inside the []s. Be careful to escape any characters that might have a special meaning to the regex parser.

You then have to iterate through the matches by using Matcher.find() until it returns false.

Maarten Bodewes
  • 90,524
  • 13
  • 150
  • 263
Code-Apprentice
  • 81,660
  • 23
  • 145
  • 268
  • 7
    Hint: [here](http://www.regular-expressions.info/charclass.html) you can read that *special characters or metacharacters inside a character class are the closing bracket (]), the backslash (\), the caret (^) and the hyphen (-)*. The usual metacharacters are normal characters inside a character class. So `"[\\.\\!\\?]"` is same as `"[.!?]"` – Pshemo Jul 28 '12 at 22:29
  • @Pshemo Thanks, I wasn't exactly sure about that. Of course, it doesn't hurt to escape these characters anyway, does it? – Code-Apprentice Jul 28 '12 at 22:31
  • I hope it doesn't because I also used escape marks in my earlier projects :) – Pshemo Jul 28 '12 at 22:34
  • 4
    @Pshemo: You forgot to escape the backslash character in your comment though :) – Maarten Bodewes Jul 28 '12 at 23:03
  • 1
    @owlstead I saw that but it was too late for edit and creating new comment to correct it was pointless since context and link is enough to figure out what should be in () :D. – Pshemo Jul 28 '12 at 23:16
  • Hey. I implemented your suggestion like this: String input = "One day! when I was walking. I found your pants? just kidding..."; Pattern p = Pattern.compile("[\\.\\!\\?]"); Matcher m = p.matcher(input); When it ran, however, it found 0 matches. Am I wrong in thinking it will match any of those characters anywhere it's found in the input string the way it's written? – Mister R2 Jul 29 '12 at 01:08
  • 1
    The whole String does not match, so you have to use `Matcher.find()`, added this to the answer. The matching string is `group()` or `group(0)` and should contain a single punctuation character. – Maarten Bodewes Jul 29 '12 at 09:19
  • Ok, yeah I see it now. Thank you all -- very much. This helped me fix my problem. – Mister R2 Jul 30 '12 at 19:18
  • `[.?!:;\-()[\]'"/,]` if you want to test for all punctuations. – Shota Sep 08 '20 at 11:14
  • Does Java not require escaping the period? – cliffclof Jan 26 '22 at 06:00
  • 1
    @cliffclof I assume not. This is more of a regex syntax issue than a Java one. Special characters inside `[]` are automatically escaped unless they have special meaning in that context. – Code-Apprentice Jan 26 '22 at 21:17
  • I did not know that one. – cliffclof Feb 06 '22 at 01:39
1

I would try

\W

it matches any non-word character. This includes spaces and punctuation, but not underscores. It’s equivalent to [^A-Za-z0-9_]

  • Unfortunately this won't work - The OP wants a regex that doesn't exclude certain non-word characters like < and > that are not punctuation. – Bill Horvath Jun 01 '20 at 17:01
  • Brackets, like "(" and ")", would also be considered punctuation. – Klaws Nov 20 '20 at 14:44
0

I was tring to find how to replace a regex, with keeping other regex part. Example: Hi , how are you ? -> Hi, how are you?. After studying a little i found that i could create groups, using "()", so just replaced the goup one, that was "(\s)".

        String a = "Hi , how are you ?";
        String p = "(\s)([,.!?\\-])";
        System.out.println(a.replaceAll(p,"$2"));
        //output: Hi, how are you?
Lucas Lombardi
  • 121
  • 1
  • 1
  • 8