2

I have a string out of which I want to extract a list of strings that are contained between two strings: [' and '] . I tried several regex rules (this question in particular) that I found online but the problem is in escaping the characters correctly to make the regex work.

How can I extract a list of strings between two strings? I want to do something like this:

List<string> TheListOfStrings = Regex.Matches(TheText, "....");

The source is a JavaScript block from which I want to extract object keys: for instrance,TheObject['SomeProp'] = TheOtherObject['OtherProp'] and so the list should contain SomeProp and OtherProp; the keys can be present multiple times in the input text.

halfer
  • 19,824
  • 17
  • 99
  • 186
frenchie
  • 51,731
  • 109
  • 304
  • 510

3 Answers3

3

Your only main difficulty is in making the square brackets be recognised as delimiting text rather than as part of the regex.

string input = "a['bc']d['ef']gh']";
MatchCollection matches = Regex.Matches(input, @"\['(?<key>.*?)'\]");
var listOfKeys = matches.Cast<Match>().Select(x => x.Groups["key"].Value);

does the trick.

If performance is important and it's going to be run multiple times, then compiling the regex will see a noticeable win:

string input = "a['bc']d['ef']gh']";
Regex re = new Regex(@"\['(?<key>.*?)'\]", RegexOptions.Compiled);
MatchCollection matches = re.Matches(input);
var listOfKeys = matches.Cast<Match>().Select(x => x.Groups["key"].Value);
ClickRick
  • 1,553
  • 2
  • 17
  • 37
  • Try now I've debugged it *properly*. – ClickRick Apr 26 '14 at 18:23
  • So the good news is that it works. The problem is that the previous implementation I had (manually looping through the input string) returned 3331 items and the regex you have is returing 3330 items. I need to see why but overall, the regex took less than a second vs my manual implementation which currently takes over 10s. – frenchie Apr 26 '14 at 18:32
  • Would you particularly want it to perform faster over large input data? – ClickRick Apr 26 '14 at 18:35
  • Ok, everything works. The faster the better, of course; this is part of a background task where users are not involved so perf is not essential but anything helps! What do you suggest? – frenchie Apr 26 '14 at 18:48
  • I've added a "compiled regex" variant. Time the old and the new over your data. – ClickRick Apr 26 '14 at 18:56
  • 1
    Ok, so the code went from about 30ms to about 20ms so it did improve with the compiled option on. Note that the previous code took about 12-15 seconds to run... Thanks for your help!!! – frenchie Apr 26 '14 at 19:02
3

Use the general pattern

(?<=prefix)find(?=suffix)

It uses lookbehind and lookahead which looks for patterns without including them in the result.

Where
  prefix   is \['; the left bracket is escaped.
  find      is .*?; sequence of any chars but as few as possible.
  suffix   is ']

(?<=\[').*?(?='])
List<string> TheListOfStrings = Regex.Matches(input, @"(?<=\[').*?(?='])")
    .Cast<Match>()
    .Select(m => m.Value)
    .ToList();

If you are calling the same regular expression repeatedly, create an resuable instance of it instead of calling the static method. Also if you are using it many times, consider using the Compiled option. It will run faster; however, the tradeoff is that the initialization time is longer.

var regex = new Regex(@"(?<=\[').*?(?='])", RegexOptions.Compiled);

while (loop_condition) {

    List<string> TheListOfStrings = regex.Matches(input)
        .Cast<Match>()
        .Select(m => m.Value)
        .ToList();
    ...

}
Olivier Jacot-Descombes
  • 104,806
  • 13
  • 138
  • 188
  • Show that pattern in context, extracting the desired parts of the input string and producing the desired output strings, using the test data in the comments under the question. – ClickRick Apr 26 '14 at 19:06
  • I added a code example. The example is tested and produces the desired output. – Olivier Jacot-Descombes Apr 26 '14 at 19:17
  • I tried that: it works and the code is cleaner. Then, I put a stopwatch and timed both. This code runs in 217ms vs 10ms for the other code. I thought it had to do with the compilation of the regex so on the following call, the results were 213ms vs 9ms. Not sure why such a difference but that's what I'm seeing. – frenchie Apr 26 '14 at 22:10
  • I added an exmaple that will be faster if you are calling it many times. – Olivier Jacot-Descombes Apr 27 '14 at 16:30
1

This may meet your needs: (?<=\[")[^"]+(?="\])|(?<=\[')[^']+(?='\])

for a['bc']d['ef']gh'] this returns bc and ef

Rahul
  • 76,197
  • 13
  • 71
  • 125
Gavin
  • 491
  • 3
  • 5
  • I tried this: var Test = Regex.Matches(TheText, "(?<=\[")[^"]+(?="\])|(?<=\[')[^']+(?='\])"); using your regex but there's a problem in the rule that's preventing compilation; it's underlined red. – frenchie Apr 26 '14 at 18:16
  • try this: string pattern = "(?<=\\[\")[^\"]+(?=\"\\])|(?<=\\[')[^']+(?='\\])"; – Gavin Apr 26 '14 at 18:25
  • still has "unrecognized escape sequences" underlines. – frenchie Apr 26 '14 at 18:50
  • @frenchie It might be as simple as the missing `@` sign before the string. – ClickRick Apr 26 '14 at 19:07