3

Is there a way to subtract characters or a character range from another character class?

I need to find a substring within a string, which should only contain characters, but without "<" and ">".

[[:print:]] - ('<' | '>')

Its because "<" and ">" are delimiters and should not occur within the string itself.

<abc> // valid
<ab<c> // invalid
<ab\tc> //invalid
KingCrunch
  • 128,817
  • 21
  • 151
  • 173
  • 1
    ... Are you trying to parse HTML with a regex? – Ignacio Vazquez-Abrams Dec 03 '10 at 14:57
  • I'm not sure what you're asking. You want to remove the `<` and `>` characters, or you want to remove any string that looks like ``? – eykanal Dec 03 '10 at 14:58
  • If you want to know: I want to parse NTriples files ( http://www.w3.org/2001/sw/RDFCore/ntriples/ ), wich also answers to the second question. I want to get the three parts of the triple. Maybe I solve it another way (split at CR, LF, or CRLF), but it would be cool, if somebody can solve the problem anyway, because in the past sometimes I missed something also. – KingCrunch Dec 03 '10 at 15:02
  • What functions do you use? Those using [POSIX ERE](http://php.net/book.regex) or [PCRE](http://php.net/book.pcre)? – Gumbo Dec 03 '10 at 15:25

2 Answers2

4

[:print:] is equivalent to [\x20-\x7E] so if you don't want < (\x3C) and > (\x3E), you can do [\x20-\x3B\x3D\x3F-\x7E]

this will match printable characters in a string except < and >

/[\x20-\x3B\x3D\x3F-\x7E]+/
Toto
  • 89,455
  • 62
  • 89
  • 125
3

In regular expressions, you can easily do union, intersection, and subtraction of character classes.

[a[b]]

is the union.

[a&&b]

is the intersection.

[a&&[^b]]

is the subtraction.

I regularly do rather complex set operations in Java. For example, this is what you have to use in Java

[^\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]

for a modern version of \w. (You don’t have to do that in Perl, since \w isn’t broken there the way it is in Java.) Word boundaries get a tad harder:

(?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]))

But at least now you have a \b that works in Java, not a broken thing that screws up everything you do. To implement \X in languages that don’t have it, you can either use a legacy grapheme cluster, defined as:

(?>\PM\pM*)

Or you can use an extended grapheme cluster, defined as (or nearly as, actually):

(?:(?:\u000D\u000A)|(?:[\u0E40\u0E41\u0E42\u0E43\u0E44\u0EC0\u0EC1\u0EC2\u0EC3\u0EC4\uAAB5\uAAB6\uAAB9\uAABB\uAABC]*(?:[\u1100-\u115F\uA960-\uA97C]+|([\u1100-\u115F\uA960-\uA97C]*((?:[[\u1160-\u11A2\uD7B0-\uD7C6][\uAC00\uAC1C\uAC38]][\u1160-\u11A2\uD7B0-\uD7C6]*|[\uAC01\uAC02\uAC03\uAC04])[\u11A8-\u11F9\uD7CB-\uD7FB]*))|[\u11A8-\u11F9\uD7CB-\uD7FB]+|[^[\p{Zl}\p{Zp}\p{Cc}\p{Cf}&&[^\u000D\u000A\u200C\u200D]]\u000D\u000A])[[\p{Mn}\p{Me}\u200C\u200D\u0488\u0489\u20DD\u20DE\u20DF\u20E0\u20E2\u20E3\u20E4\uA670\uA671\uA672\uFF9E\uFF9F][\p{Mc}\u0E30\u0E32\u0E33\u0E45\u0EB0\u0EB2\u0EB3]]*)|(?s:.))

Of course, you don’t have to go through such extreme rewrites if you happen to be using a language with the radical notion of actually supporting their own native character set!

Unfortunately, Java is not one of those.

For regexes, I suggest using something more modern, like Perl, Python, or Ruby. Because otherwise you’re stuck in the Stone Age.

Community
  • 1
  • 1
tchrist
  • 78,834
  • 30
  • 123
  • 180