11

I'm trying to match a control character in the form \^c where c is any valid character for control characters. I have this regular expression, but it's not currently working: \\[^][@-z]

I think the problem lies with the fact that the caret character (^) is part of the regular expressions parsing engine.

Cameron Tinker
  • 9,634
  • 10
  • 46
  • 85
  • 1
    That doesn’t make sense to me. Is there a backslash there? Are these real control characters, or some ASCII sequence implying the same? Why go \c@ .. \cZ only? There are others, you know. – tchrist Feb 04 '11 at 01:51
  • Why are you putting the caret in a character class anyway? – Anon. Feb 04 '11 at 01:51
  • I'm trying to match the literal text for the control characters, not the control characters themselves. – Cameron Tinker Feb 04 '11 at 01:58
  • Control-X is defined as the character whose code point is the result of `^-ing` the code point of `X` with the code point of `@`; that is, bit 0x40. – tchrist Feb 04 '11 at 01:59

2 Answers2

11

Match an ASCII text string of the form ^X using the pattern \^., nothing more. Match an ASCII text string of the form \^X with the pattern \\\^.. You may wish to constrain that dot to [?@_\[\]^\\], so \\\^[A-Z?@_\[\]^\\]. It’s easier to read as [?\x40-\x5F] for the bracketed character class, hence \\\^[?\x40-\x5F] for a literal BACKSLASH, followed by a literal CIRCUMFLEX, followed by something that turns into one of the valid control characters.

Note that that is the result of printing out the pattern, or what you’d read from a file. It’s what you need to pass to the regex compiler. If you have it as a string literal, you must of course double each of those backslashes. `\\\\\\^[?\\x40-\\x5F]" Yes, it is insane looking, but that is because Java does not support regexes directly as Groovy and Scala — or Perl and Ruby — do. Regex work is always easier without the extra bbaacckksslllllaasshheesssssess. :)

If you had real control characters instead of indirect representations of them, you would use \pC for all literal code points with the property GC=Other, or \p{Cc} for just GC=Control.

tchrist
  • 78,834
  • 30
  • 123
  • 180
  • I'm not quite sure I understand \pC. I'm writing a lexical analyzer using JLex and I need to recognize valid control character sequences in a string and translate them to their ASCII equivalents. For example, the string "\^g" would print the bell character or cause the computer speaker to beep. I need a regular expression to match control character sequences like "\^g". – Cameron Tinker Feb 04 '11 at 01:54
  • @pcman: Do you have a literal BACKSLASH followed by a literal CIRCUMFLEX followed by a character that is one of `[A-Z@?\[\]_^]`? – tchrist Feb 04 '11 at 01:57
  • Yes, I am trying to match the literal text as it would appear in a string. – Cameron Tinker Feb 04 '11 at 02:10
  • This matches accents as well – Tofandel May 09 '22 at 12:06
3

Check this out: http://www.regular-expressions.info/characters.html . You should be able to use \cA to \cZ to find the control characters..

gbvb
  • 866
  • 5
  • 10
  • 1
    What about `\c?` for DELETE (U+7F)? Did you know that Java thinks `\c{` is `;` and that `\c;` is `{`? They forgot to check that result is `\p{Cc}`. Oops! – tchrist Feb 04 '11 at 01:55
  • Stale link. The page still exists, but the info about `\c` has moved to https://www.regular-expressions.info/nonprint.html – LarsH Sep 02 '21 at 17:31