5

Is there a regex which accepts any symbol?

EDIT: To clarify what I'm looking for.. I want to build a regex which will accept ANY number of whitespaces and the it must contain atleast 1 symbol (e.g , . " ' $ £ etc.) or (not exclusive or) at least 1 character.

tchrist
  • 78,834
  • 30
  • 123
  • 180
Skizit
  • 43,506
  • 91
  • 209
  • 269
  • Please define "Symbol" - is it any char including whitespaces? Or anything *but* whitespaces... – Andreas Dolk Dec 03 '10 at 12:53
  • @Ulkmum: See my answer: you are including things that Java has trouble with, because they’re in its native character set instead of the legacy character set. If you have to do deal with any of these: `!"#$%&'()*+,-./:;<=>?@[\]^_ˋ{|}~¡¢£¤¥¦§¨©«¬®¯°±´¶·¸»¿×÷˂˃˄˅˘˙˚˜˝϶҂՚׀׃׆׳״‐‑‒–—―‖‗‘’‚‛“”„‟†‡•‹›‼‽‾‿⁀` then you must use my fancier formulations. – tchrist Dec 03 '10 at 13:20
  • 1
    Uhm, correct me if I'm wrong, but all of those characters are included in the `\S` class, no? – aioobe Dec 03 '10 at 13:28
  • @Ulkmun: I’m afraid the selected answer is wrong. I can make it fail on simple data very easily. :( – tchrist Dec 03 '10 at 13:47
  • @aioobe: In Java — but not in Perl — the pattern `^\s*\S+$` “succeeds” against `"\t\n   "`. I find that counterintuitive to the point of being wrong: obviously it should fail, not succeed. Nothing but the casuistry of a language-lawyer paid off by the Evil Empire could make anyone believe otherwise. It is simply nuts! – tchrist Dec 03 '10 at 13:50
  • @tchrist: I'm not sure I follow you. `"\t\n "` does not match `^\s*\S+$`. `\S+` says that there must be at least one non-whitespace character, and there are none. [Check this ideone.com demo](http://ideone.com/GFcMc). – aioobe Dec 03 '10 at 13:57
  • Wrong, check this demo: `String sample = "\t\n "; String regex = "\\s*\\S+$"; stdout.printf("String '%s' %s pattern /%s/\n", sample, sample.matches(regex) ? "MATCHES" : "FAILS TO MATCH", regex);` that prints this out (with the newline gobbled by SO): `String '  ' MATCHES pattern /^\s*\S+$/`. Do you understand why? I think you may become upset with me if I have to tell you instead of your figuring it out for yourself. ☹ This is real-world problem I stumbled upon in my job doing biomedical text-mining. It really sucks! – tchrist Dec 03 '10 at 14:22

2 Answers2

8

Yes. The dot (.) will match any symbol, at least if you use it in conjunction with Pattern.DOTALL flag (otherwise it won't match new-line characters). From the docs:

In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.


Regarding your edit:

I want to build a regex which will accept ANY number of whitespaces and the it must contain atleast 1 symbol (e.g , . " ' $ £ etc.) or (not exclusive or) at least 1 character.

Here is a suggestion:

\s*\S+
  • \s* any number of whitespace characters
  • \S+ one or more ("at least one") non-whitespace character.
aioobe
  • 413,195
  • 112
  • 811
  • 826
  • Right, so a regex that would accept strings which contain any number of whitespaces and ATLEAST 1 word and any number of symbols would be... `\\s*\\p{Alnum}[\\p{Alnum}\\s]*` ... where does the dot go? – Skizit Dec 03 '10 at 12:50
  • Strictly speaking LF and CR are control codes not symbols but you're still correct in that `.` won't match every possible character value. – Lazarus Dec 03 '10 at 12:51
  • Aren't we confusing "symbol" with "character"? I interpreted "symbol" in the question as "non-alphanumeric character". – BalusC Dec 03 '10 at 12:52
  • I suppose you could change `[\\p{Alnum}\\s]*` into `.*` instead. – aioobe Dec 03 '10 at 12:53
  • Generally when you ask for help with regular expressions, it helps a lot if you provide a few examples of strings that should match, and a few examples of strings that should not match. – aioobe Dec 03 '10 at 12:54
  • ah, well change it to `\s*\S.*` then. Then you're actually quite close to what I suggested previously, change `[\p{Alnum}\s]*` into `.*`: you would then get `\s*\p{Alnum}.*`. – aioobe Dec 03 '10 at 13:04
  • I see a non-ASCII character in your example. You therefore *must* use my solution. Sorry! – tchrist Dec 03 '10 at 13:08
  • **WARNING:** Java’s `\s` fails to match things like U+A0, NO-BREAK SPACE or ` `. Java’s `\p{Punct}` fails to match things like the `£` used in the OP’s example. Java’s `\S` fails to match things like U+85, NEXT LINE (NEL). And Java’s `\b\w+\b` fails to match the string `"élève"` **anywhere whatsoever**. Java’s regex char-class are completely broken. You cannot use them. You have to use the formulations I describe in my answer. I deeply regret this, but it is true, and regret will not change that. – tchrist Dec 03 '10 at 13:13
  • @tchrist, IMHO, I believe you're a bit picky. Besides `\w` is clearly document *not* to match `é`, right? Also, there is no need to "shout" using bold caps... – aioobe Dec 03 '10 at 13:17
  • @Ulkmun, I see your concern. Try this pattern: `(?s)\s*\S.*` (or construct your pattern using the `Pattern.DOTALL` flag). – aioobe Dec 03 '10 at 13:19
  • @aioobe: There is a need to shout when day in and day out, you see people making the same mistakes over and over again. The Java charclass shortcuts and the POSIX character classes in Java **work only on legacy data**. They do not even work with Java’s own native character set! This is a *very* serious issue, one I feel people need to be fully informed of. The user in this case mentioned non-legacy data, and you are all giving him legacy-only solutions. If I need to shout to get this gross oversight noticed, then I shall certainly do so. – tchrist Dec 03 '10 at 13:23
  • I work for a biomedical text-mining group at a public university. Well under 1% of text we process falls in the legacy category. Our code is all in Java and Perl. Because Perl regexes handle modern data transparently, **but Java’s do not,** a great deal of effort must be made just to get Java regexes to work with Java characters! It **is** an important issue, one that everyone doing regexes in Java needs to understand. Do **you** understand why Java is incapable of matching the string `"élève"` with the pattern `\b\w+\b` *anywhere at all*, not just the whole thing? Few do, and we few fear it. – tchrist Dec 03 '10 at 13:30
  • @aioobe: *“Picky?”* Is it picky to expect `"élève"` to have *at least* one match using `\b\w+\b`? What’s so picky about that? It’s *not* picky. Compile that pattern. Match it against that string. Try both *matches()* for the whole thing and *find()* for anywhere. Nothing. Niente. Nada. **Do you understand why?** If you do not, you should not be advocating these legacy-only solutions. If you do, then could you please explain to me the contorted justification under which this insane situation makes the least scintilla of sense? – tchrist Dec 03 '10 at 13:44
0

In Java, a symbol is \pS, which is not the same as punctuation characters, which are \pP.

I talk about this issue, plus enumerate the types for all the ASCII punctuation and symbols, here in this answer.

Patterns like [\p{Alnum}\s] only work on legacy dataset from the 1960s. To work on things with the Java native characters set, you needs something on the order of

identifier_charclass = "[\\pL\\pM\\p{Nd}\\p{Nl}\\p{Pc}[\\p{InEnclosedAlphanumerics}&&\\p{So}]]";
whitespace_charclass = "[\\u000A\\u000B\\u000C\\u000D\\u0020\\u0085\\u00A0\\u1680\\u180E\\u2000\\u2001\\u2002\\u2003\\u2004\\u2005\\u2006\\u2007\\u2008\\u2009\\u200A\\u2028\\u2029\\u202F\\u205F\\u3000]";

ident_or_white = "[" + identifier_charclass + whitespace_charclass + "]";

I’m sorry that Java makes it so difficult to work with modern dataset, but at least it is possible.

Just don’t ask about boundaries or grapheme clusters. For that, see my others posting.

Community
  • 1
  • 1
tchrist
  • 78,834
  • 30
  • 123
  • 180
  • *"Patterns like `[\p{Alnum}\s]` only work on legacy dataset from the 1960s"* -- Uhm, no, I've seen them work on a few newer ones too... – aioobe Dec 03 '10 at 13:29
  • @aioobe: Nope, you have *not:* `[\p{Alnum}\s]+$` fails on even simple things like `£20`, on `"this and that"`, and on `"the Molière exhibition"`. Welcome to Java! *Are we having fun yet?* – tchrist Dec 03 '10 at 13:39
  • Well, `\p{Alnum}` is clearly documented to match `[a-zA-Z0-9]`, so I wouldn't say that the behavior is buggy. Heck I would have been *surprised* if it matched a `£`. – aioobe Dec 03 '10 at 13:47
  • Fine: add `\p{Punct}` then. Despite their disingenuous bait&switch re Unicode,Java’s stuck in the Dark Ages of computing, the 1960s. They have **fundamentally misunderstood** that `\b` and `\w` are **and must be** ineluctably linked. By severing that linkage they have created asinine Catch-22s in their language that confuse, confound, and consternate anyone trying to use them. You have 3 choices: [1] Don’t use Java regexes [2] Painstakingly rewrite all Java regexes by hand following the guidelines I have here and elsehwere set forth [3] Use my alpha rewrite code now, beta and production later. – tchrist Dec 03 '10 at 14:01