Extract BBCode using regex

Question

I'm trying to extract a subset of BBCode ([U],[B],[I]) from a string using a regex. I found plenty of questions asking how to simply parse/replace BBCode in a string, but I want to extract all parts - both the normal text parts and the ones enclosed in tags.

I came up with the following regex: (.*?)(\[[UBI]\](.*?)\[\/[UBI]\])(.*?)

It seems to almost work, except it misses any "normal text" at the end of the string. For example

test1 [B]bold text[/B] test2 [U]underlined[/U] test3

This will result in two matches

Match 1:
  group1: test1
  group2: [B]bold text[/B]
  group3: bold text

Match 2:
  group1: test2
  group2: [U]underlined[/U]
  group3: underlined

How can I make it match the trailing test3 as well (either as a new Match or as group4 (which was my intention)?

That seems to work. Maybe it depends on settings of the environment that you're running it in (which you didn't mention)... — MBaas, Aug 09 '21 at 10:30
@MBaas Hmm, where did you try it? I tried it both on [regexr.com](https://regexr.com/) and in Dart code, and no matter which modifiers I try I can't get the last part (`test3`) to be included in the matches. — Magnus, Aug 09 '21 at 11:19
I tried it in Regexbuddy (sorry, needs download, a Windows app) — MBaas, Aug 09 '21 at 12:32

score 0 · Answer 1 · answered Aug 09 '21 at 16:24

The problem is with the .*? pattern at the end of the regex pattern. It never consumes any text because lazy pattern is always skipped first, the subsequent patterns are tried first. Here, there is nothing after .*? and it means it is fine to return a valid match without consuming anything with the last .*?.

One possible solution is splitting the string with a regex that keeps captured substrings in the output. Unfortunately, it is not directly supported by Dart, so I enhanced this solution to account for your case:

extension RegExpExtension on RegExp {
  List<List<String?>> allMatchesWithSep(String input, int grpnum, bool includematch, [int start = 0]) {
    var result = List<List<String?>>.empty(growable: true);
    for (var match in allMatches(input, start)) {
      var res = List<String?>.empty(growable: true);
      res.add(input.substring(start, match.start));
      if (includematch) {
          res.add(match.group(0));
      }
      for (int i = 0; i < grpnum; i++) {
          res.add(match.group(i+1));
      }
      start = match.end;
      result.add(res);
    }
    result.add([input.substring(start)]);
    return result;
  }
}

extension StringExtension on String {
  List<List<String?>> splitWithDelim(RegExp pattern, int grpnum, bool includematch) =>
      pattern.allMatchesWithSep(this, grpnum, includematch);
}

void main() {
  String text = "test1 [B]bold text[/B] test2 [U]underlined[/U] test3";
  RegExp rx = RegExp(r"\[[UBI]\]([\w\W]*?)\[\/[UBI]\]");
  print(text.splitWithDelim(rx, 1, true));
}

Output:

[[test1 , [B]bold text[/B], bold text], [ test2 , [U]underlined[/U], underlined], [ test3]]

Note the pattern now contains just one capturing group, and this is grpnum value (group number). Since you need the whole match in the results, the includematch is set to true.

The [\w\W] will match any chars including line break chars, . does not match them by default.

Extract BBCode using regex

1 Answers1