I found an interesting regex in a Java project: "[\\p{C}&&\\S]"
I understand that the &&
means "set intersection", and \S
is "non-whitespace", but what is \p{C}
, and is it okay to use?
The java.util.regex.Pattern documentation doesn't mention it. The only similar class on the list is \p{Cntrl}
, but they behave differently: they both match on control characters, but \p{C}
matches twice on Unicode characters above U+FFFF, such as PILE OF POO
:
public class StrangePattern {
public static void main(String[] argv) {
// As far as I can tell, this is the simplest way to create a String
// with code points above U+FFFF.
String poo = new String(Character.toChars(0x1F4A9));
System.out.println(poo); // prints ``
System.out.println(poo.replaceAll("\\p{C}", "?")); // prints `??`
System.out.println(poo.replaceAll("\\p{Cntrl}", "?")); // prints ``
}
}
The only mention I've found anywhere is here:
\p{C} or \p{Other}: invisible control characters and unused code points.
However, \p{Other}
does not seem to exist in Java, and the matching code points are not unused.
My Java version info:
$ java -version
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)
Bonus question: what is the likely intent of the original pattern, "[\\p{C}&&\\S]"
? It occurs in a method which validates a string before it is sent in an email: if that pattern is matched, an exception with the message "Invalid string" is raised.