3

I'm trying to match, using regex, all commas(followed by a space): , that are outside any parentheses or square brackets, i.e. the comma should not be contained in the parentheses or square brackets.

The target string is A, An(hi, world[hello, (hi , world) world]); This, These. In this case, it should match the first comma and the last comma (the ones between A and An, this and these).

So I could split A, An(hi, world[hello, (hi , world) world]); This, These into A, An(hi, world[hello, (hi , world) world]); This and These, not leaving parens/brackets unbalanced as a result.

To that end, it seems hard to use regex alone. Is there any other approach to this problem?

The regex expression I'm using: , (?![^()\[\]]*[\)\]])

But this expression will match other extra two commas , (the second and the third) which shouldn't have been matched.

Though if it is matching against the following strings, it'll match the right comma (the first one respectively): A, An(hi, world) and A, An[hi, world]

But if the parenthesis and brackets contain each other, it'll be problems.

More details in this link: https://regex101.com/r/g8DOh6/1

jonah_w
  • 972
  • 5
  • 11
  • 1
    Must it be regex alone for it? With `Text::Balanced` (for example) one can extract balanced parens/brackets and the rest, and then pick commas out of "the rest." – zdim Sep 19 '21 at 02:43
  • @zdim I've updated the post. Not necessarily regex alone. Anything will do to solve the problem. – jonah_w Sep 19 '21 at 02:54
  • OK, thank you! So ... what do you want the final result to be? Words right before _those_commas (without the commas)? Please see my answer and let me know (I'll edit more) -- it solves the problem but I don't know what the actual _result_ should be! – zdim Sep 19 '21 at 03:27
  • The aim is to split the target string with the comma on the outside, say the target string is `B, C, hello(D,) world`, the expected output is `B` `C` `hello(D,) world` – jonah_w Sep 19 '21 at 04:06
  • So I could go on the last step: turn the `hello(D,) world` into `hello world`. This post is not about this final step, though. It's more the preparation for the last step. – jonah_w Sep 19 '21 at 04:07
  • Alright, so strip these particular commas, outside of top-level `(...)`, `[...]` pairs. Thanks. – zdim Sep 19 '21 at 04:18
  • Btw ... it's easier to get to that "final step" with this tool (`Regexp::Common`), since it matches exactly those `(...)`. Do you want that then? That's what I have right now in my answer... – zdim Sep 19 '21 at 04:21

3 Answers3

4

The problem here is in identifying "balanced" pairs, of parenthesis/brackets in this case. This is a well recognized problem, for which there are libraries. They can find the top-level matching pairs, (...)/[...] with all that's inside, and all else outside parens -- then process the "else."

One way, using Regexp::Common

use warnings;
use strict;
use feature 'say';

use Regexp::Common;

my $str = shift // q{A, t(a,b(c,))u B, C, p(d,)q D,}; 

my @all_parts = split /$RE{balanced}{-parens=>'()[]'}/, $str;

my @no_paren_parts = grep { not /\(.*\) | \[.*\]/x } @all_parts;

say for @no_paren_parts;

This uses split's property to return the list with separators included when the regex in the separator pattern captures. The library regex captures so we get it all back -- the parts obtained by splitting the string by what regex matched but also the parts matched by the regex. The separators contain the paired delimiters while other terms cannot, by construction, so I filter them out by that. Prints

A, t
u B, C, p
q D,

The paren/bracket terms are gone, but how the string is split is otherwise a bit arbitrary.

The above is somewhat "generic," using the library merely to extract the balanced pairs ()/[], along with all other parts of the string. Or, we can remove those patterns from the string

$str =~ s/$RE{balanced}{-parens=>'()[]'}//g;

to stay with

A, tu B, C, pq D,

Now one can simply split by commas

my @terms = split /\s*,\s*/, $str;
say for @terms;

for

A
tu B
C
pq D

This is the desired result in this case, as clarified in comments.

Another most notable library, in many ways more fundamental, is the core Text::Balance. See Shawn's answer here, and for example this post and this one and this one for examples.


An example. With

my $str = q(it, is; surely);

my @terms = split /[,;]/, $str;

one gets it is surely in the array @terms, while with

my @terms = split /([,;])/, $str;

we get in @terms all of: it , is ; surely


Also by construction, it contains what the regex matched at even indices. So for all other parts we can fetch elements at odd indices

my @other_than_matched_parts = @all_parts[ grep { not $_ & 1 } 0..$#all_parts ];
zdim
  • 64,580
  • 5
  • 52
  • 81
  • Thanks for your answer. One question though, say the target string is `A, t(a,b(c,)) B, C, u(d,) D,`, the result from the answer would be `A, t` `B, C, u` `D,` where `u` and `D` are separated. Can they form as one? the `u` and `D`? As very often in the dictionary data, phrases like `go about, go (a)round,` is very common. I want to extract them as `go about` and `go round`. – jonah_w Sep 19 '21 at 04:44
  • As for `A, t(a,b(c,)) B, C, u(d,) D,`, the output should be `A, t` `B, C, ` `u D,`. – jonah_w Sep 19 '21 at 04:47
  • @jonah_w Alright, get it now -- the commas stay after all, just remove `(...)`. So -- the output should be an array, like in the first part of the answer? (Not one string like in the second part?) – zdim Sep 19 '21 at 04:48
  • @jonah_w so ... the output in the first part of my answer is exactly what you need then, no? The array with elements: `A, t` and `B, C,` and `u D,` (copy-pasted from the answer, the first part) – zdim Sep 19 '21 at 04:52
  • `u D` is correct when the target string is `[d,]u D,`. But very often the target string could be `u[d,] D,` with `u` in front of `[d,]`, in this case, the result would be wrong. – jonah_w Sep 19 '21 at 04:55
  • @jonah_w Ah, I ee what you mean (and now I see why your example in these comments has `u` on the "other" side of parens :) ... let me look ... – zdim Sep 19 '21 at 05:03
  • @jonah_w What is the rationale for `B, C, u(d,) D,` giving `B, C,` and `u D,` -- and not `B, C, u D,` (Or `B, C, u` and `D,` or some other variation). How would you state it? (If it can be clearly formulated than it'll be simple to code it -- the key thing here was to identify those `(...`) and the `$RE...` does that :) – zdim Sep 19 '21 at 05:03
  • @jonah_w I edited the answer's sample, to `A, t(a,b(c,))u B, C, p(d,)q D,` -- so it's easier to explain what the answer should be, and why/how – zdim Sep 19 '21 at 05:10
  • By splitting the target string `B, C, u(d,) D,` with comma outside of `(…)` we can get the expected result: `B` `C` `u(d,) D` And `u(d,) D` in the final step will be further extracted as `u D` – jonah_w Sep 19 '21 at 05:15
  • @jonah_w Can splitting by comma do it -- like so: `go about, go (a)round` --> `go about, go round` (remove `(...)` like in the second part of the answer), then split by comma for `go about` and `go round` (Can there be embedded commas in parts of the expression?) – zdim Sep 19 '21 at 05:16
  • @jonah_w OK, I see. (Posted my previous comment before seeing your last one) – zdim Sep 19 '21 at 05:19
  • @jonah_w Added to the end of the second part. Let me know whether there are any other cases which wouldn't be covered by that approach. (Remove `(...)` then split on commas.) One distinct question is whether there can be embedded commas, so that `C,` is in fact `"ah, no",` or some such... That can be adjusted for, before splitting by commas. It's a different problem then. – zdim Sep 19 '21 at 05:25
  • 1
    @jonah_w Note that you can feed input into the program above, `prog.pl "....."` (that's what that `shift // ...` in the beginning does -- you do need quotes around the input string if it conatins spaces). so when I run it on your example string from the question the final output (from the second part) is: `A`, `An; This`, `These` – zdim Sep 19 '21 at 05:38
2

Checking if a comma , is within brackets/parenthesis e.g.

[(,),],[abc,(def,[ghi,],),],[(,),]
      ^                    ^

means that the pattern must be aware when exactly each of those brackets/parenthesis were opened and closed in a balanced way, so not just e.g [([] because it should be [([])].

Here is an alternative solution that doesn't solve your problem directly but might be a step closer.

  1. Match either of the following:

    a. Comma

    b. A group enclosed in an outer [] or (). See Regular expression to match balanced parentheses

  2. Filter out 1.b

Regex pattern:

(?:\((?>[^()]|(?R))*\)|\[(?>[^\[\]]|(?R))*\]|,)

enter image description here

For this string, the matches are as pointed out:

A, An(hi, world[hello, (hi , world) world]) and this, is that, for [the, one (in, here, [is not,])] and last,here!
 ^   ^------------------------------------^         ^        ^     ^------------------------------^         ^
  • So it didn't capture any commas inside any of those bracket/parenthesis groups as it captured them as a whole. Now, you have the commas at the outer level.
1

zdim mentioned one approach is to use the core Text::Balanced module. Demonstration:

#!/usr/bin/env perl
use strict;
use warnings;
use feature qw/say/;
use Text::Balanced qw/extract_bracketed/;

my $str = "A, An(hi, world[hello, (hi , world) world]); This, These";
my ($inside, $after, $before) = extract_bracketed $str, '()[]', qr/[^([]*/;

my @tokens = (split(/,/, $before//""), $inside, split(/,/, $after//""));

# Displays
# A  An (hi, world[hello, (hi , world) world]) ; This  These
say join(' ', @tokens);
Shawn
  • 47,241
  • 3
  • 26
  • 60