5

I'm trying to split a string representing an XPath such as:

string myPath = "/myns:Node1/myns:Node2[./myns:Node3=123456]/myns:Node4";

I need to split on '/' (the '/' excluded from results, as with a normal string split) unless the '/' happens to be within the '[ ... ]' (where the '/' would both not be split on, and also included in the result).

So what a normal string[] result = myPath.Split("/".ToCharArray()) gets me:

result[0]: //Empty string, this is ok
result[1]: myns:Node1
result[2]: myns:Node2[.
result[3]: myns:Node3=123456]
result[4]: myns:Node4

results[2] and result[3] should essentially be combined and I should end up with:

result[0]: //Empty string, this is ok
result[1]: myns:Node1
result[2]: myns:Node2[./myns:Node3=123456]
result[3]: myns:Node4

Since I'm not super fluent in regex, I've tried manually recombining the results into a new array after the split, but what concerns me is that while it's trivial to get it to work for this example, regex seems the better option in the case where I get more complex xpaths.

For the record, I have looked at the following questions:
Regex split string preserving quotes
C# Regex Split - commas outside quotes
Split a string that has white spaces, unless they are enclosed within "quotes"?

While they should be sufficient in helping be with my problem, I'm running into a few issues/confusing aspects that prevent them from helping me.
In the first 2 links, as a newbie to regex I'm finding them hard to interpret and learn from. They are looking for quotes, which look identical between left and right pairs, so translating it to [ and ] is confusing me, and trial and error is not teaching me anything, rather, it's just frustrating me more. I can understand fairly basic regex, but what these answers do is a little more than what I currently understand, even with the explanation in the first link.
In the third link, I won't have access to LINQ as the code will be used in an older version of .NET.

Community
  • 1
  • 1
Code Stranger
  • 103
  • 1
  • 10
  • I agree that the regex in the linked questions can tend to overwhelm beginners... I like to think I'm half decent with regex when I need to be, but I'll admit those do confuse me... – Broots Waymb Nov 29 '16 at 16:36

4 Answers4

5

XPath is a complex language, trying to split an XPath expression on slashes at ground level fails in many situations, examples:

/myns:Node1/myns:Node2[./myns:Node3=123456]/myns:Node4
string(/myns:Node1/myns:Node2)

I suggest an other approach to cover more cases. Instead of trying to split, try to match each parts between slashes with the Regex.Matches(String, String) method. The advantage of this way is that you can freely describe how look these parts:

string pattern = @"(?xs)
    [^][/()]+ # all that isn't a slash or a bracket
    (?: # predicates (eventually nested)
        \[ 
        (?: [^]['""] | (?<c>\[) | (?<-c>] )
          | "" (?> [^""\\]* (?: \\. [^""\\]* )* ) "" # quoted parts
          | '  (?> [^'\\]*  (?: \\. [^'\\]*  )* ) '
        )*?
        (?(c)(?!$)) # check if brackets are balanced
        ]
      |  # same thing for round brackets
        \(
        (?: [^()'""] | (?<d>\() | (?<-d>\) )
          | "" (?> [^""\\]* (?: \\. [^""\\]* )* ) ""
          | '  (?> [^'\\]*  (?: \\. [^'\\]*  )* ) '
        )*?
        (?(d)(?!$))
        \)
    )*
  |
    (?<![^/])(?![^/]) # empty string between slashes, at the start or end
";

Note: to be sure that the string is entirely parsed, you can add at the end of the pattern something like: |\z(?<=(.)). This way, you can test if the capturing group exists to know if you are at the end of the string. (But you can also use the match position, the length and the length of the string.)

demo

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • That would be the regex we can't tweet :P – vks Nov 29 '16 at 18:03
  • 1
    @vks: It's a bit long, I admit. – Casimir et Hippolyte Nov 29 '16 at 18:06
  • With a regex that long, I'd almost rather just parse the string manually. :/ – Abion47 Nov 29 '16 at 20:06
  • @Abion47: feel free to do it (with the same features or better). – Casimir et Hippolyte Nov 29 '16 at 20:15
  • 1
    @CasimiretHippolyte I've added a manual parsing option as an answer. I'm not altogether familiar with the XPath syntax, though, so I would be grateful if you could point out an error case I might have overlooked. – Abion47 Nov 29 '16 at 20:42
  • @Abion47: It's a good initiative, and if well done, probably the most efficient way in term of performances (since C# is compiled). It was the missing answer for this kind of questions. I'm not sure to have in mind all the particularities of XPath syntax, but for the record, I tried to handle in my answer cases where there are: 1) eventual predicates (eventually nested) with eventual paths inside, 2) eventual predicates with quoted parts inside (parts that may contain slashes), 3) functions like `string(...)` that must be seen as a single part. – Casimir et Hippolyte Nov 29 '16 at 20:53
3

If a Regex pattern of a complexity like Casimir et Hippolyte suggests is required, then perhaps Regex is not the best option in this circumstance. To add a non-Regex possible solution, here is what the process might look like when the XPath string is parsed manually:

public string[] Split(string input, char splitChar, char groupStart, char groupEnd)
{
    List<string> splits = new List<string>();

    int startIdx = 0;
    int groupNo = 0;

    for (int i = 0; i < input.Length; i++)
    {
        if (input[i] == splitChar && groupNo == 0)
        {
            splits.Add(input.Substring(startIdx, i - startIdx));
            startIdx = i + 1;
        }
        else if (input[i] == groupStart)
        {
            groupNo++;
        }
        else if (input[i] == groupEnd)
        {
            groupNo = Math.Max(groupNo - 1, 0);
        }
    }

    splits.Add(input.Substring(startIdx, input.Length - startIdx));

    return splits.Where(s => !string.IsNullOrEmpty(s)).ToArray();
}

Personally, I think this is much easier to both understand and implement. To use it, you can do the following:

var input = "/myns:Node1/myns:Node2[./myns:Node3=123456]/myns:Node4[text(‌​)='some[] brackets']";
var split = Split(input, '/', '[', ']');

This will output the following:

split[0] = "myns:Node1"
split[1] = "myns:Node2[./myns:Node3=123456]"
split[2] = "myns:Node4[text(‌​)='some[] brackets']"
Abion47
  • 22,211
  • 4
  • 65
  • 88
  • As an aside, it isn't because a pattern is long, that it is less efficient than a shorter pattern. Keep in mind that something like `a(?!.*b)` must test the string until the end for each successful match. That isn't the case in my pattern where each result is separated from the previous by only one failing position (at each slashes). But I admit it can be better written using a capture group to extract parts and the `\G` anchor to ensure the matches contiguity. – Casimir et Hippolyte Nov 29 '16 at 21:03
  • 1
    @CasimiretHippolyte I'm not saying that Regex is undesirable because a longer pattern means its less efficient. I'm saying that the longer a Regex pattern becomes, the more cumbersome it gets to debug, maintain, and understand. It gets to the point that you are forcing a tool to work for no reason other than because its what you've decided to use. In these cases, a custom parsing method becomes more desirable, as it's simpler to understand and maintain, and could very well be more efficient in execution. (Always keep the KISS principle in mind.) – Abion47 Nov 29 '16 at 21:51
  • @CasimiretHippolyte I've run a rudamentary benchmark, and during 10,000 iterations my parser runs 2-3 times faster than the Regex approach. https://dotnetfiddle.net/fld3p8 – Abion47 Nov 29 '16 at 22:13
  • "I'm saying that the longer a Regex pattern becomes, the more cumbersome it gets to debug": This is a common but childish idea. About the KISS principle, I'm not in the US Navy, sorry I can't understand. About your benchmark, it isn't relevant since your code doesn't handle many cases (quoted strings, path enclosed in a function, escaped quotes, ...). And why do you want to execute 10000 times the same thing? – Casimir et Hippolyte Nov 29 '16 at 22:30
  • 1
    `This is a common but childish idea.` It's not childish at all. I would much rather maintain a compile-safe simple and concise method than a verbose potentially error-prone Regex pattern. – Abion47 Nov 29 '16 at 22:52
  • 1
    `About the KISS principle, I'm not in the US Navy, sorry I can't understand.` The Kiss principle stands for "Keep It Simple, Stupid". It's a principle for helping to prevent going crazy with a particular mindset to the extent that forcing it to work involves making the process far more complicated than it needs to be. – Abion47 Nov 29 '16 at 22:53
  • `About your benchmark, it isn't relevant since your code doesn't handle many cases` Keep in mind here, that OP's problem is not about parsing XPath exactly. He merely wants to split a string into pieces on `'/'`, but ignoring the delimiter when it is within square brackets. Going crazy on all the nested possibilities is not necessary when all you need to do is a simple split given specific simple rules. – Abion47 Nov 29 '16 at 22:56
  • `And why do you want to execute 10000 times the same thing?` Because that is how you run benchmarks. Running it once doesn't produce accurate results because of timing errors due to CPU time and process priorities, whereas running it thousands of times or more and taking the average is a much more accurate representation of actual runtime. – Abion47 Nov 29 '16 at 22:59
  • @CasimiretHippolyte Oops, just realized I forgot to tag in my responses. – Abion47 Nov 30 '16 at 16:38
1

The second link you posted is actually perfect for your needs. All it needs is some tweaking to detect brackets instead of apostrophes:

\/(?=(?:[^[]*\[[^\]]*])*[^]]*$)

Basically what it does is it only includes forward slashes that are proceeded by a left square bracket and then a right square bracket before the next forward slash. You can use it like so:

string[] matches = Regex.Split(myPath, "\\/(?=(?:[^[]*\\[[^\\]]*])*[^]]*$)")
Abion47
  • 22,211
  • 4
  • 65
  • 88
  • Excellent! Thanks! The confusing part of that link was just which apostrophe should be which bracket. But with this I can compare the 2 and learn from it! – Code Stranger Nov 29 '16 at 16:47
  • This won't work once there is a bracket inside some string literal, e.g. `/myns:Node1/myns:Node2[./myns:Node3=123456]/myns:Node4[text()='some[] brackets']` – Wiktor Stribiżew Nov 29 '16 at 16:50
  • @WiktorStribiżew i am not an expert in xpath but have not seen such , i mean `[]` inside quotes – vks Nov 29 '16 at 16:53
  • 1
    @vks: a value with brackets inside is frequent for example in forms with checkboxs: ``, to reach them with XPath, you may write `//input[@name="myvar[]"`. Other thing, in XPath language, predicates can be nested: `//span[./ancestor::div[@class="myclass"]]` – Casimir et Hippolyte Nov 29 '16 at 17:04
  • @CasimiretHippolyte i know its possible but i have not seen such a weird naming in any projects i have worked in....the second case seems more legit!!!!!!!!!!! thanx – vks Nov 29 '16 at 17:07
  • @vks: checkboxs are most of the time named like that. (the goal is to obtain an array for all checkboxes of a form via GET or POST on server side) – Casimir et Hippolyte Nov 29 '16 at 17:18
1
\/(?![^\[]*\])

Try this.See demo.

https://regex101.com/r/uLcWux/1

Use with @ or \\/(?![^\\[]*\\])

P.S This is only for simple xpaths not having nested parenthesis or [] inside quotes

vks
  • 67,027
  • 10
  • 91
  • 124