9

I want to extract the row key(here is 28_2820201112122420516_000000), the column name(here is bcp_startSoc), and the value(here is 64.0) in $str, where $str is a row from HBase:

# `match` is OK
my $str = '28_2820201112122420516_000000 column=d:bcp_startSoc, timestamp=1605155065124, value=64.0';
my $match = $str.match(/^ ([\d+]+ % '_') \s 'column=d:' (\w+) ',' \s timestamp '=' \d+ ',' \s 'value=' (<-[=]>+) $/);
my @match-result = $match».Str.Slip;
say @match-result;   # Output: [28_2820201112122420516_000000 bcp_startSoc 64.0]

# `smartmatch` is OK
# $str ~~ /^ ([\d+]+ % '_') \s 'column=d:' (\w+) ',' \s timestamp '=' \d+ ',' \s 'value=' (<-[=]>+) $/
# say $/».Str.Array; # Output: [28_2820201112122420516_000000 bcp_startSoc 64.0]

# `comb` is NOT OK
# A <( token indicates the start of the match's overall capture, while the corresponding )> token indicates its endpoint. 
# The <( is similar to other languages \K to discard any matches found before the \K.
my @comb-result = $str.comb(/<( [\d+]+ % '_' )> \s 'column=d:' <(\w+)> ',' \s timestamp '=' \d+ ',' \s 'value=' <(<-[=]>+)>/);
say @comb-result;    # Expect: [28_2820201112122420516_000000 bcp_startSoc 64.0], but got [64.0]

I want comb to skip some matches, and just match what i wanted, so i use multiple <( and )> here, but only get the last match as result.

Is it possible to use comb to get the same result as match method?

Elizabeth Mattijsen
  • 25,654
  • 3
  • 75
  • 105
chenyf
  • 5,048
  • 1
  • 12
  • 35

3 Answers3

4

TL;DR Multiple <(...)>s don't mean multiple captures. Even if they did, .comb reduces each match to a single string in the list of strings it returns. If you really want to use .comb, one way is to go back to your original regex but also store the desired data using additional code inside the regex.

Multiple <(...)>s don't mean multiple captures

The default start point for the overall match of a regex is the start of the regex. The default end point is the end.

Writing <( resets the start point for the overall match to the position you insert it at. Each time you insert one and it gets applied during processing of a regex it resets the start point. Likewise )> resets the end point. At the end of processing a regex the final settings for the start and end are applied in constructing the final overall match.

Given that your code just unconditionally resets each point three times, the last start and end resets "win".

.comb reduces each match to a single string

foo.comb(/.../) is equivalent to foo.match(:g, /.../)>>.Str;.

That means you only get one string for each match against the regex.

One possible solution is to use the approach @ohmycloudy shows in their answer.

But that comes with the caveats raised by myself and @jubilatious1 in comments on their answer.

Add { @comb-result .push: |$/».Str } to the regex

You can workaround .comb's normal functioning. I'm not saying it's a good thing to do. Nor am I saying it's not. You asked, I'm answering, and that's it. :)

Start with your original regex that worked with your other solutions.

Then add { @comb-result .push: |$/».Str } to the end of the regex to store the result of each match. Now you will get the result you want.

raiph
  • 31,607
  • 3
  • 62
  • 111
3
$str.comb( /  ^ [\d+]+ % '_' | <?after d\:> \w+  | <?after value\=> .*/ )
ohmycloudy
  • 629
  • 5
  • 13
  • Could you please explain? – jubilatious1 Nov 19 '20 at 19:32
  • 3
    Hi @jubilatious1. Perhaps you just mean to encourage .@ohmycloudy to expand their answer to improve it? Either way... `comb` combs through a string, seeking to keep bits of it. This is perhaps why it seemed an appropriate tool to use. But multiple `<(` or `)>` won't do what's wanted. (See comment on the Q.) What you *can* do is comb for one of several patterns separated by `|`, using zero width assertions to not capture bits not wanted, and not worrying about how they are ordered relative to each other because the patterns are specific enough to correctly extract the desired data. – raiph Nov 19 '20 at 22:26
  • 1
    Hi @raiph, I guess I'm just much more likely to test out some code when I see a bit of explanation, before or afterwards. The answer by @ohmycloudy works, but using `|` means less than the three desired values could be returned, per line. This is in addition to the issue you raise--that the patterns could be found in any order in the line. Hence my reasoning to `split()` the string, throw it into an array, and match on individual elements of the array. – jubilatious1 Nov 19 '20 at 23:00
2

Since you have a comma-separated 'row' of information you're examining, you could try using split() to break your matches up, and assign to an array. Below in the Raku REPL:

> my $str = '28_2820201112122420516_000000 column=d:bcp_startSoc, timestamp=1605155065124, value=64.0';
28_2820201112122420516_000000 column=d:bcp_startSoc, timestamp=1605155065124, value=64.0
> my @array = $str.split(", ")
[28_2820201112122420516_000000 column=d:bcp_startSoc timestamp=1605155065124 value=64.0]
> dd @array
Array @array = ["28_2820201112122420516_000000 column=d:bcp_startSoc", "timestamp=1605155065124", "value=64.0"]
Nil
> say @array.elems
3

Match on individual elements of the array:

> say @array[0] ~~ m/ ([\d+]+ % '_') \s 'column=d:' (\w+) /;
「28_2820201112122420516_000000 column=d:bcp_startSoc」
 0 => 「28_2820201112122420516_000000」
 1 => 「bcp_startSoc」
> say @array[0] ~~ m/ ([\d+]+ % '_') \s 'column=d:' <(\w+)> /;
「bcp_startSoc」
 0 => 「28_2820201112122420516_000000」
> say @array[0] ~~ m/ [\d+]+ % '_'  \s 'column=d:' <(\w+)> /;
「bcp_startSoc」

Boolean tests on matches to one-or-more array elements:

> say True if ( @array[0] ~~ m/ [\d+]+ % '_'  \s 'column=d:' <(\w+)> /)
True
> say True if ( @array[2] ~~ m/ 'value=' <(<-[=]>+)> / )
True
> say True if ( @array[0] ~~ m/ [\d+]+ % '_'  \s 'column=d:' <(\w+)> /) & ( @array[2] ~~ m/ 'value=' <(<-[=]>+)> / )
True

HTH.

jubilatious1
  • 1,999
  • 10
  • 18
  • 1
    Thanks for your answer, it works very good. I just want to explore the limit of of the `comb` method, and find the using of `<( )>` is a mistake, as @raiph explained. – chenyf Nov 20 '20 at 05:22