8

In Dart, I would like to split a string using a regular expression and include the matching delimiters in the resulting list. So with the delimiter ., I want the string 123.456.789 to get split into [ 123, ., 456, ., 789 ].

In some languages, like C#, JavaScript, Python and Perl, according to https://stackoverflow.com/a/15668433, this can be done by simply including the delimiters in capturing parentheses. The behaviour seems to be documented at https://ecma-international.org/ecma-262/9.0/#sec-regexp.prototype-@@split.

This doesn't seem to work in Dart, however:

print("123.456.789".split(new RegExp(r"(\.)")));

yields exactly the same thing as without the parentheses. Is there a way to get split() to work like this in Dart? Otherwise I guess it will have to be an allMatches() implementation.

Edit: Putting ((?<=\.)|(?=\.)) for the regex apparently does the job for a single delimiter, with lookbehind and lookahead. I will actually have a bunch of delimiters, and I'm not sure about efficiency with this method. Can someone advise if it's fine? Legibility is certainly reduced: to allow delimiters . and ;, would one need ((?<=\.)|(?=\.)|(?<=;)(?=;)) or ((?<=\.|;)|(?=\.|;). Testing

print("123.456.789;abc;.xyz.;ABC".split(new RegExp(r"((?<=\.|;)|(?=\.|;))")));

indicates that both work.

Ozzin
  • 83
  • 1
  • 5
  • 1
    Split on `(?!^|$)\b` – ctwheels Dec 31 '19 at 17:37
  • The delimiter isn't always going to be `.` - it could be one of a bunch of expressions. – Ozzin Dec 31 '19 at 17:56
  • 1
    that's fine, I didn't specify `.`, it'll split on word boundary locations – ctwheels Dec 31 '19 at 17:56
  • What's expected from `123.456.789;abc;.xyz.;ABC`? – ctwheels Dec 31 '19 at 18:00
  • You need to write a custom method for it, String.split does not allow this in Dart. – Wiktor Stribiżew Dec 31 '19 at 18:29
  • @ctwheels: I might want `delim` as a delimiter. From the example I gave in the edit, I would want `[ 123, ., 456, ., 789, ;, abc, ;, ., xyz, ., ;, ABC ]` – Ozzin Dec 31 '19 at 18:32
  • @WiktorStribiżew: Ok, seems like it. The look{ahead|behind} method seems to work: it matches any empty characters and then looks to see whether they come before or after a `.`. But I don't know enough about regular expressions to tell whether this is an inefficient way to match things. – Ozzin Dec 31 '19 at 18:35
  • @Ozzin using regex is *easy* but may not be the best tool (depends on the exact job). That being said, it looks like you may be able to use regex for this. For your edit, you can use `(?!^|$)\b|(?!\w)\B(?!\w)` assuming your version of dart is 2.3.0 or greater (they added lookbehinds in that version: https://github.com/dart-lang/sdk/blob/master/CHANGELOG.md#230---2019-05-08). You can see this regex in use [here](https://regex101.com/r/slh4Mm/1) - my substitution acting to mimic your split for display purposes. – ctwheels Dec 31 '19 at 19:34

2 Answers2

10

There is no direct support for it in the standard library, but it is fairly straightforward to roll your own implementation based on RegExp.allMatches(). For example:

extension RegExpExtension on RegExp {
  List<String> allMatchesWithSep(String input, [int start = 0]) {
    var result = <String>[];
    for (var match in allMatches(input, start)) {
      result.add(input.substring(start, match.start));
      result.add(match[0]!);
      start = match.end;
    }
    result.add(input.substring(start));
    return result;
  }
}

extension StringExtension on String {
  List<String> splitWithDelim(RegExp pattern) =>
      pattern.allMatchesWithSep(this);
}

void main() {
  print("123.456.789".splitWithDelim(RegExp(r"\.")));
  print(RegExp(r" ").allMatchesWithSep("lorem ipsum dolor sit amet"));
}
Dabbel
  • 2,468
  • 1
  • 8
  • 25
Reimer Behrends
  • 8,600
  • 15
  • 19
  • Excellent - I didn't know about extensions. This fits the bill well. One might want to check for empty strings in some places, for example when adding in the final part of `input`, but that depends on the application. – Ozzin Jan 01 '20 at 15:13
  • Thanks. (The following is obvious for who know regexps) If, for instance, you have different possible separators, like `'.'` and `':'`, you need to use a regexp like `'[\.:]'`, etc. – DenisGL Jan 21 '21 at 20:38
  • Amazing, thank you very much, saved me from a real headache. – PaianuVlad23 Jul 28 '21 at 14:42
  • As mentioned in the question how can we use multiple delimiters with this method ? – Shahbaz Hashmi Apr 25 '22 at 19:11
  • Multiple delimiters can simply be encoded in a regular expression using character classes (for single character delimiters) or alternatives, e.g. `[,;]` or `,|;|\.\.` – Reimer Behrends Apr 29 '22 at 18:23
1

Splitting on single delimiter

Given your initial string:

123.456.789

And expected results (split on and including delimiters):

[123, ., 456, ., 789]

You can come up with the following regex:

(?!^|$)\b

Matches locations that match a word boundary, except for the start/end of the line.


Splitting on multiple delimiters

Now for your edit, given the following string:

123.456.789;abc;.xyz.;ABC

You'd like the expected results (split on and including multiple delimiters):

[123, ., 456, ., 789, ;, abc, ;, ., xyz, ., ;, ABC]

You can use the following regex (adapted from first - added alternation):

See regex sample here (I simulate split by using substitution with newline character for display purposes).

Either of the following work.

(?!^|$)\b|(?!\w)\B(?!\w)
(?!^|$)\b|(?=\W)\B(?=\W)

# the long way (with case-insensitive matching) - allows underscore _ as delimiter
(?!^|$)(?:(?<=[a-z\d])(?![a-z\d])|(?<![a-z\d])(?=[a-z\d])|(?<![a-z\d])(?![a-z\d]))

Matches locations that match a word boundary, except for the start/end of the line; or matches a location that doesn't match a word boundary, but is preceded by or followed by a non-word character.

Note: This will work in Dart 2.3.0 and up since lookbehind support was added (see here for more info).

ctwheels
  • 21,901
  • 9
  • 42
  • 77
  • I wanted to allow to split by any regular expression (determined by the user); the example with a `.` was just an example. It's not clear to me whether this allows for that. The look{ahead|behind} code I posted in the edit works for that, but the performance isn't clear to me. – Ozzin Jan 01 '20 at 15:17