1

I want a regex that extracts all the words inside a "WUB" but didin't found any solution! for example it would extract from "WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB"

the following strings (without the quotes) ["WE", "ARE", "THE", "CHAMPIONS"]

here's what i've tried so far:

((?:.(?!WUB))+) But it gives me the following output (from the example above):

['WUBW', 'WUBAR', 'WU', 'WUBTH', 'WUBCHAMPION', 'WUBM', 'WUBFRIEN', 'WUB']

Please help me more understand this problem

Amin Guermazi
  • 1,632
  • 9
  • 19
  • 2
    The construct you used is a corrupt temepered greedy token. You hoped to match any text but a sequence of chars, but it is not used for that, there is no such a construct in regex. TGT only matches any char, 0 or 1 or more occurrences, that does not start some sequence. Just split with `WUB` and remove empty items. Or, use `WUB(.*?)(?=WUB)` and get Group 1 values. Or `(?<=WUB).*?(?=WUB)`, see [demo](https://regex101.com/r/2jkGHs/1) – Wiktor Stribiżew Apr 17 '20 at 21:10
  • Simple implementation of described above `$_ ne '' && push @result, $_ for split('WUB', $data);`. – Polar Bear Apr 18 '20 at 00:20
  • None of the answers to the linked question answered the OP's question, so I reopened the question. – ikegami Apr 18 '20 at 02:18
  • Please fix the tags to exclude the language you aren't using. (If you have a similar question for more than one language, post them as separate questions.) – ikegami Apr 18 '20 at 02:21

3 Answers3

3
$str =~ / WUB \K (?:(?!WUB).)+ (?=WUB) /sxg

or

$str =~ / (?<=WUB) (?:(?!WUB).)+ (?=WUB) /sxg    # Probably slower.

Starting after WUB, without actually including the WUB in the match (\K), find one or more characters that aren't the start of WUB. Make sure it's followed by WUB ((?=WUB)).


If the string will always start and end with WUB, or if you don't mind getting the text before the first WUB and after the last WUB, the following is a lot clearer and surely faster:

grep length, split /WUB/, $str
ikegami
  • 367,544
  • 15
  • 269
  • 518
0

A simple REGEX-Expression without the look-ahead/look-behind assertions is:

 /WUB((?:[^W]|W[^U]|WU[^B])+)/g

This assumes, that the string tested ends with the WUB. If it doesn't, you either have to include a zero-with look-ahead assertion (?=WUB) to the end,

 /WUB((?:[^W]|W[^U]|WU[^B])+)(?=WUB)/g

or remove any characters behind the last WUB before using the regex.

 s/WUB(?:[^W]|W[^U]|WU[^B])+$/WUB/

.

#! /usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $s = "WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB";

print Dumper ([$s =~ /WUB((?:[^W]|W[^U]|WU[^B])+)/g]);

prints out:

$VAR1 = [
          'WE',
          'ARE',
          'THE',
          'CHAMPIONS',
          'MY',
          'FRIEND'
        ];
Georg Mavridis
  • 2,312
  • 1
  • 15
  • 23
  • You are right. I assumed the string was "well-formed" as in the example. If it isn't, the easiest solution is to add a zero-with look-ahead assertion (?=WUB) to the end, or remove any characters behind the last WUB before using the regex. – Georg Mavridis Apr 18 '20 at 13:01
0

Another way to do it, using split:

my $str = "WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB";

# grep is here to remove empty elements
my @list = grep length, split /WUB/, $str;
say Dumper\@list;

Output:

$VAR1 = [
          'WE',
          'ARE',
          'THE',
          'CHAMPIONS',
          'MY',
          'FRIEND'
        ];

Benchmark:

use Modern::Perl;
use Benchmark qw(:all);

my $str = "WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB";

my $count = -3;
cmpthese($count, {
    'match' => sub {
        my @list = $str =~  / WUB \K (?:(?!WUB).)+ (?=WUB) /sxg;
    },
    'split' => sub {
        my @list = grep length, split /WUB/, $str;
    },
});

Output:

          Rate match split
match  57806/s    --  -54%
split 126455/s  119%    --
Toto
  • 89,455
  • 62
  • 89
  • 125
  • This fails for `abcWUBdefWUBghi`, returning text not in between `WUB`. (I would have suggested `grep length, split /WUB/, $str` if the OP hadn't explicitly specified they wanted text bewteen `WUB`) – ikegami Apr 18 '20 at 11:38
  • To put the numbers into perspective, the match approach is 1/55486 s - 1/99924 s = 8 microseconds slower than the split approach. – ikegami Apr 18 '20 at 11:48
  • If you use split you can simply remove the first and the last entry of the results. – Georg Mavridis Apr 18 '20 at 13:22
  • @GeorgMavridis: That's what I've done, but it didn't work if the string doesn't begin with `WUB`. – Toto Apr 18 '20 at 13:27