Perl regular expression to split string by word

Question

I have a string which consists of several words (separated by Capital letter).

For example:

$string1="TestWater"; # to be splited in an array @string1=("Test","Water")
$string2="TodayIsNiceDay"; # as @string2=("Today","Is","Nice","Day")
$string3="EODIsAlwaysGood"; # as @string3=("EOD","Is","Always","Good")

I know that Perl easily split uses the split function for fixed character, or the match regex can separate $1, $2 with fixed amount of variable. But how can this be done dynamically? Thanks in advance!

That post Spliting CamelCase doesn't answer my question, my question is more related to regex in Perl, that one was in Java (differences apply here).

Possible duplicate of [RegEx to split camelCase or TitleCase (advanced)](https://stackoverflow.com/questions/7593969/regex-to-split-camelcase-or-titlecase-advanced) — Joe, Jul 14 '17 at 12:22
Not duplicate, there are differences between java regex and perl regex. But thanks for checking! — dellair, Jul 14 '17 at 12:40

score 7 · Answer 1 · answered Jul 14 '17 at 12:31

7

Use split to split a string on a regex. What you want is an upper case character not followed by an upper case character as the boundary, which can be expressed by two look-ahead assertions (perlre for details):

#!/usr/bin/perl
use warnings;
use strict;

use Test::More;

sub split_on_capital {
    my ($string) = @_;
    return [ split /(?=[[:upper:]](?![[:upper:]]))/, $string ]
}

is_deeply split_on_capital('TestWater'),       [ 'Test', 'Water' ];
is_deeply split_on_capital('TodayIsNiceDay'),  [ 'Today', 'Is', 'Nice', 'Day' ];
is_deeply split_on_capital('EODIsAlwaysGood'), [ 'EOD', 'Is', 'Always', 'Good' ];

done_testing();

answered Jul 14 '17 at 12:31

choroba

231,213
25
204
289

Why doesn't that produce empty leading elements for `TestWater` and `TodayIsNiceDay`? – melpomene Jul 14 '17 at 12:35
2

@melpomene: Documented in [split](http://p3rl.org/split): zero-width match at the beginning of EXPR never produces an empty field – choroba Jul 14 '17 at 12:37

score 4 · Accepted Answer · answered Jul 14 '17 at 12:33

You can do this by using m//g in list context, which returns a list of all matches found. (Rule of thumb: Use m//g if you know what you want to extract; use split if you know what you want to throw away.)

Your case is a bit more complicated because you want to split "EODIs" into ("EOD", "Is").

The following code handles this case:

my @words = $string =~ /\p{Lu}(?:\p{Lu}+(?!\p{Ll})|\p{Ll}*)/g;

I.e. every word starts with an uppercase letter (\p{Lu}) and is followed by either

1 or more uppercase letters (but the last one is not followed by a lowercase letter), or
0 or more lowercase letters (\p{Ll})

I really like your answer, very neat. Thank you very much! – dellair Jul 14 '17 at 12:38 — dellair, Jul 14 '17 at 12:38

Perl regular expression to split string by word

2 Answers2