0

I'm trying to detect the pattern:

The string "BP" followed by either 2 or 4 double values (that I later want to capture) all separated by whitespaces.

For instance:

  • BP 1.0 3.5
  • BP -1e-3 0.72 3.7 1.22e2

To detect double, I'm using the pattern [+\-]?(?:0|[1-9]\d*)(?:\.\d*)?(?:[eE][+\-]?\d+)? which I obtained from here.

Unfortunately after testing a few strings, I discovered that my code fails to discriminate when then string BP is followed by either 2 or 4 numbers. Here is some test case:

void Main()
{
    var testString = "BP -1.23e4 5.67";

    var mspaces = @"\s*"; // meaning as many spaces as you want
    var cdouble = @"([+\-]?(?:0|[1-9]\d*)(?:\.\d*)?(?:[eE][+\-]?\d+)?)"; // meaning capture a double

    var shortPattern = String.Join("",  mspaces, "BP", mspaces, cdouble, mspaces, cdouble, mspaces);
    var longPattern = String.Join("",  mspaces, "BP", mspaces, cdouble, mspaces, cdouble, mspaces, cdouble, mspaces, cdouble, mspaces);

    var bpShort = Regex.Match(testString, shortPattern, RegexOptions.IgnoreCase);
    var bpLong = Regex.Match(testString, longPattern, RegexOptions.IgnoreCase);

    if (bpLong.Success)
    {
        Console.WriteLine("Long pattern detected"); // !!FALSE-MATCH!!
    }
    if (bpShort.Success)
    {
        Console.WriteLine("Short pattern detected");
    }   
}  

In this example, even if there are only two numbers (-1.23e4 and 5.67), the code is matching for 4 different numbers (-1.23e4, 5., 6, 7)

Maybe I'm wrong adding enclosing parenthesis to indicate I want to capture all number sub-elements or maybe should I further indicate that a double ends with either whitespace or end-of-string, I don't know ?

Community
  • 1
  • 1
CitizenInsane
  • 4,755
  • 1
  • 25
  • 56

1 Answers1

2

It is rather obvious. A regex always aims to find as many matches as possible. So if you look for four numbers, the regex will do it best to split up the string such that four numbers are matched.

To solve the problem, you need to enforce spaces between two matches.

This can be done by replacing:

var mspaces = @"\s*";

By:

var mspaces = @"\s+";

(+ means one or more wheras * means zero or more, so the regex can decide to not use space between two numbers.)

You also should remove the beginning spaces in the regex concatenation. Thus replace:

String.Join("",  mspaces, "BP"...

by:

String.Join("",  "BP"...

As well as the tailing mspaces. In that case, one gets this ideone.com.

Perhaps you don't want to match strings like ABP 1 5 because there must be some space between A and BP. In that case you can use a word boundary @"\b".

Finally as @MattBurland argues, any pattern with four numbers is of course a pattern with two numbers. if you want your string to end, you can use the $ at the end. If you want the string to start with BP you can use ^ in the front.

Community
  • 1
  • 1
Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
  • Also worth noting for the OP, the long string will still match the short pattern as well, which they might consider to be an error. Adding anchors (`^` and `$`) would solve that problem. – Matt Burland Jan 22 '15 at 18:57
  • Thank you very much for these clear explanations, it make sense. Using `$` will also be much better than testing longer string before shorter one (this was how I intended to discriminate both). – CitizenInsane Jan 22 '15 at 20:19
  • For information, strings I'm parsing are filter descriptions read from a file.I have string like `BP 1.0 2.0` (Bandpass between 1.0 and 20), or `BR 1.0 1.1 2.0 2.1` (Band reject with first transitions at 1.0 to 1.1 and second transition at 2.0 to 2.1). Sometimes there are extras spaces before `BP`, `BR`, `HP` letters ... a trim can be sufficient.. – CitizenInsane Jan 22 '15 at 20:26