1

I need a regex to match a string as follows:

  • Must begin with [
  • Must contain a ]
  • Is allowed to have any characters (including whitespaces) between the [ and the ]
  • Must contain at least one character between [ and ]
  • Is allowed to have a ; after the ]. Following the ; all characters are allowed (although sort of irrelevant since I don't care about it)
  • If and only if a ; after a ] is present, whitespaces (read tabs, spaces - although I can guarantee no \r\n\f\v will be present, which is why I used \s below) are allowed between the ] and the ;. If ; is not present after the ], then ] must be the end of the string.

I ended up with the following regex which passed all my initial tests: ^\[([^]]+)](?:\s+?;)?.

Speed is key here, so I am looking to improve on the regex that I have in order to shave off a few cycles if possible.

I'm not really sure whether the usage of a lookahead would be useful here.

EDIT

eg:

[some;thing] - Valid, with capture group some;thing

[something] - Valid, with capture group something

[something] - Invalid, does not begin with [

[something] ;ojasodj - Valid, capture group something

[something] - Invalid, space after ] without a ; present

[something]; - Valid, capture group something

[] - Invalid, must contain at least one character between [ and ]

JonU
  • 83
  • 6
  • 1
    `(?:\s+?;)?` should be `(?:\s*;)?` otherweise you require a space in front of the `;` in order to match. You also do not properly test for the end of the string for your last condition. Do you use this in any kind of fullmatch function? – Sebastian Proske Jan 15 '19 at 13:43
  • 1
    @SebastianProske good point about the `\s*` instead of `\s+?`. About the last condition, I feel like the current regex implies it. Since I specified that it must begin with `[` and what is allowed to come after `]`, for the regex to match either `]` must be the last character or it must be followed by an (optional) unknown number of whitespaces and a `;`. – JonU Jan 15 '19 at 13:48
  • 1
    @JonU just add `[x]test` to your testcases to see. You might also add the code you use to call the regex. – Sebastian Proske Jan 15 '19 at 13:50
  • Will it be used with "search" or "match"? It can be relevant for productivity – Superluminal Jan 15 '19 at 13:52
  • @Predicate I need to get the string inside the `[` `]` – JonU Jan 15 '19 at 13:55
  • @SebastianProske point taken, regex updated to `^\[([^]]+)](?:(?:\s*;)|$)` – JonU Jan 15 '19 at 14:00
  • Depending on your data you might want to turn around the order of your alternation (and you can drop the innermost grouping). `$` should fail or succeed quicker than the other thing, unless a vast majority of your cases is in the lines of `[x] ;stuff`. – Sebastian Proske Jan 15 '19 at 14:06
  • @SebastianProske I'm running some benchmarks so please provide an answer with a regex I can test against. If it's the fastest (and correct), I'll mark it. As for the data, this is for a file using a format similar to ini, which means that most lines will **not** start with `[` at all. But of the ones that do, I'd say about 99% will also **not** contain a `;`. – JonU Jan 15 '19 at 14:13
  • There is no real problem here. No downsides of current solution has been stated. I'm voting to close. – revo Jan 15 '19 at 14:27
  • @revo actually if you bother to read the regex or the comments, you'll find at least 2 bugs with the original regex. As for the lack of downsides of the current solution, I'm not a regex expert which is why I'm here asking for direction from those that know more than me. Were I to already know if there were downsides/bottlenecks on the current regex, surely I would have been able to fix it myself. Apologies for not knowing that since about ... *now!* you need to already know the answer to the question that you're asking if you want to post in SO. – JonU Jan 15 '19 at 14:35
  • I never knew I had a problem with my current approach, clearly I wasn't testing for the right things. And if you believe that, with the bugs corrected, the original regex is faster, put that it an answer and explain why it's faster than the accepted answer and I'll review it. If it's the case, it will be marked as the correct answer. But downvoting my question because I was unaware of a couple of bugs and therefore didn't say "hey can you help me fix it" is just wrong... – JonU Jan 15 '19 at 14:47
  • @revo please read the comment under the accepted solution. The OP did benchmark the different approaches and they indeed differ in performances by two or three order of magnitudes. The OP clearly stated his problem: `Speed is key` and he accepted solution based on this criterion. – fjardon Jan 15 '19 at 14:47
  • 1
    @JonU [There is no problem in your question that others can reproduce](https://stackoverflow.com/help/on-topic). You are telling us you even didn't know that your own workaround was fine enough or not. So what if it was fine? (though it is). I don't see any real question (even a question mark out of those regular expressions!) in this question. – revo Jan 15 '19 at 14:55
  • @revo also notice my original regex matches `[x]test` for example. – JonU Jan 15 '19 at 14:58
  • @revo would it make you happier if I edit the question to add a big "it don't work" to it and saying it's matching `[x]test` when it shouldn't? Please remember to vote this down as well: https://stackoverflow.com/questions/279507/what-is-meant-by-immutable I can't reproduce it. – JonU Jan 15 '19 at 15:01
  • There are some exceptions to SO. There are many off-topic high traffic highly voted much helpful questions that users are able but won't like to clear them away. This doesn't include yours. You'd better edit your question to state the problems you have with. – revo Jan 15 '19 at 15:12
  • If I'm a gold badge holder in regex tag I know what I'm talking about. I'm done at this point. – revo Jan 15 '19 at 15:50

3 Answers3

2

TL;DR: ^\[([^]]+)](?:$|\s*;)

^\[([^]]+)] is already the optimal way to match the first part of your regex, unless you can drop the capturing group. By using the negated character class you avoid any kind of unnecessary backtracking in failing cases that would be involved in any kind of .* or .*? pattern.

To fulfill your other rules, you need to either match the end of the string ($) or optional spaces and a semicolon, so that should be (?:$|\s*;). I would put the $ first, as this is shorter match (thus quicker success), but this also depends on your data (if the second case ius the vast majority, put that first).

Full pattern being ^\[([^]]+)](?:$|\s*;)

Be aware, that $might be followed by an optional \n, but your testcases didn't look multiline :)

Sebastian Proske
  • 8,255
  • 2
  • 28
  • 37
  • Correct, this is not multiline – JonU Jan 15 '19 at 14:20
  • In a test with live data and 100k iterations, this was the fastest regex (followed by my revised version on the comments, then fjardon's and finally Michal's. I particularly like that it first attempts to match `$` first as in 99% of the cases a `]` will be followed by `$`. – JonU Jan 15 '19 at 14:24
  • @JonU can you post the results of your benchmarks ? I think it would be interesting to see what are the timing differences between the different approaches – fjardon Jan 15 '19 at 14:29
  • @fjardon I feel like posting the results without the data to go with it would be counterproductive as clearly the results depend hugely on the data being used. And I can't share the data used in the tests. However I will say that over a course of 10 runs Sebastian's solution was consistently faster by about 30ms~300ms when compared with the second best solution and about 1sec faster than the slowest solution, with the slowest solution taking close to 10secs to complete. – JonU Jan 15 '19 at 14:40
  • @JonU well if the numbers are so much away from each others and the dataset was the same then I find it interesting to know which version did behave *worst*. Clearly some constructs were behaving *very* badly. Understanding why this is, is productive :) – fjardon Jan 15 '19 at 14:43
  • @fjardon I had already stated above that Michal's answer was the worst performing one. – JonU Jan 15 '19 at 14:48
0

Try this pattern ^\[[^\]]+\](?(?=\s*;)\s*;.*|$)

Explanation:

^\[[^\]]+\] will match text enclosed in square brackets at the beginning of the string (^) (at least one character other than ] inside them).

(?(?=\s*;)\s*;.*|$) - if what follows after enclosing square bracket is only whitespaces and semicolon, then match them, otherwise assure that it's end of string ($).

Demo

Michał Turczyn
  • 32,028
  • 14
  • 47
  • 69
0

Here's how you can do that with code instead

public static bool IsValid(string str, out string capture)
{
    capture = null;

    // A null string is invalid
    if(str == null) return false;

    // An empty string is invalid
    if(str.Length == 0) return false;

    // A string that does not start with [ is invalid
    if(str[0] != '[') return false;
    int end = str.IndexOf(']');

    // A string that does not have a ] is invalid
    if(end == -1) return false;

    // A string that does not have anything between the [ and ] is invalid
    if(end == 1) return false;

    // if the ] is not the end of the string we need to look for a ;.
    if(end != str.Length -1)
    {
        bool semicolon = false
        for(int i = end + 1; i < str.Length; i++)
        {
            // ; found so we can stop looking at characters.
            if(str[i] == ';') 
            {
                semicolon = true;
                break;
            }

            // If non-whitespace is between the ] and ; the string is invalid
            if(!char.IsWhiteSpace(str[i])) return false;
        }

        // No ; found so the string is invalid
        if(!semicolon) return false;
    }

    // Capture the string between [ and ]
    capture = str.Substring(1,end - 1);
    return true;
}

Obviously not as short as a regular expression, but might run faster.

juharr
  • 31,741
  • 4
  • 58
  • 93
  • Wouldn't it be faster to use `if (end != str.Length - 1 && str.Substring(1, end).TrimStart()[0] != ';') { return false}` ? – JonU Jan 15 '19 at 14:53
  • @JonU I think you mean `if (end != str.Length - 1 && str.Substring(end + 1).TrimStart()[0] != ';') return false`. And that will create a string of everything after the ] and then trim the whitespace to create another string so that's actually doing more looping than is needed and creating extra throw away strings. It could be a bit faster to skip the `IndeOf` for the semicolon and just do the `for` loop until you hit it making sure everything before it is whitespace. – juharr Jan 15 '19 at 14:57