5

I'm trying to write a regular expression for Java that matches if there is a semicolon that does not have two (or more) leading '-' characters.

I'm only able to get the opposite working: A semicolon that has at least two leading '-' characters.

([\-]{2,}.*?;.*)

But I need something like

([^([\-]{2,})])*?;.*

I'm somehow not able to express 'not at least two - characters'.

Here are some examples I need to evaluate with the expression:

; -- a           : should match
-- a ;           : should not match
-- ;             : should not match
--;              : should not match
-;-              : should match
---;             : should not match
-- semicolon ;   : should not match
bla ; bla        : should match
bla              : should not match (; is mandatory)
-;--;            : should match (the first occuring semicolon must not have two or more consecutive leading '-')
Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
Richard
  • 582
  • 5
  • 19
  • How many semicolons can be in string? Is string like `-;--;` correct? – Pshemo Jul 21 '14 at 15:04
  • Also do we want to forbid only leading `-`? What about strings like `x--;`? – Pshemo Jul 21 '14 at 15:06
  • @Pshemo The first one has to match (updated my question accordingly). The second one must not match, just to keep things simple. Otherwise I would need to write a complete parser and thats not the intention of my small application. – Richard Jul 21 '14 at 15:13

5 Answers5

2

It seems that this regex matches what you want

String regex = "[^-]*(-[^-]+)*-?;.*";

DEMO

Explanation: matches will accept string that:

  • [^-]* can start with non dash characters
  • (-[^-]+)*-?; is a bit tricky because before we will match ; we need to make sure that each - do not have another - after it so:
    • (-[^-]+)* each - have at least one non - character after it
    • -? or - was placed right before ;
  • ;.* if earlier conditions ware fulfilled we can accept ; and any .* characters after it.

More readable version, but probably little slower

((?!--)[^;])*;.*

Explanation:

To make sure that there is ; in string we can use .*;.* in matches.
But we need to add some conditions to characters before first ;.

So to make sure that matched ; will be first one we can write such regex as

[^;]*;.*

which means:

  • [^;]* zero or more non semicolon characters
  • ; first semicolon
  • .* zero or more of any characters (actually . can't match line separators like \n or \r)

So now all we need to do is make sure that character matched by [^;] is not part of --. To do so we can use look-around mechanisms for instance:

  • (?!--)[^;] before matching [^;] (?!--) checks that next two characters are not --, in other words character matched by [^;] can't be first - in series of two --
  • [^;](?<!--) checks if after matching [^;] regex engine will not be able to find -- if it will backtrack two positions, in other words [^;] can't be last character in series of --.
Pshemo
  • 122,468
  • 25
  • 185
  • 269
0

You need a negative lookahead!

This regex will match any string which does not contain your original match pattern:

(?!-{2,}.*?;.*).*?;.*

This Regex matches a string which contains a semicolon, but not one occuring after 2 or more dashes.

Example: Regex Working

Adam Yost
  • 3,616
  • 23
  • 36
0

How about using this regex in Java:

[^;]*;(?<!--[^;]{0,999};).*

Only caveat is that it works with up to 999 character length between -- and ;

Java Regex Demo

Community
  • 1
  • 1
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • Thanks, @anubhava. I would prefer a solution without groups. But in general your solution works. – Richard Jul 21 '14 at 15:16
  • 1
    Hi anubhava I have created a question almost dedicated to you! lol. I really like that tecnhique. Could you check it? http://stackoverflow.com/questions/24808793/regex-technique-to-disallow-variable-length-lookbehind-using-or/ – Federico Piazza Jul 21 '14 at 15:21
0

How about just splitting the string along -- and if there are two or more sub strings, checking if the last one contains a semicolon?

Edwin Buck
  • 69,361
  • 7
  • 100
  • 138
  • 1
    I would prefer a solution where I can call just .matches(), because all other statements in my class work this way. Just for reading purposes. – Richard Jul 21 '14 at 15:18
  • @RichardW. That's fine; however, if it is for readability's sake, some of these answers are far less readable (if you care about _understandability_) than two or three calls which make the task obvious. – Edwin Buck Jul 21 '14 at 15:56
0

I think this is what you're looking for:

^(?:(?!--).)*;.*$

In other words, match from the start of the string (^), zero or more characters (.*) followed by a semicolon. But replacing the dot with (?:(?!--).) causes it to match any character unless it's the beginning of a two-hyphen sequence (--).

If performance is an issue, you can exclude the semicolon as well, so it never has to backtrack:

^(?:(?!--|;).)*;.*$

EDIT: I just noticed your comment that the regex should work with the matches() method, so I padded it out with .*. The anchors aren't really necessary, but they do no harm.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156