1

I might have obfuscated the real problem in my previous question, so here it is:

Let's suppose that you want to know if the characters from 7 to 15 are all 1 in a string:

my $str = "011000001111111111000011111110110111110000101010111";
my $i = 7, $j = 15;  # => start and end positions of the substring
my $l = $j - $i + 1; # => length of the substrings 

I have thought of a few reasonable ways for doing the check

# 1. with `substr` and a regex:
printf "#1: %d\n", substr($str,$i,$l) =~ /\A1*\Z/;

# 2. with a regex from the beginning of the string:
printf "#2: %d\n", $str =~ /\A.{$i}1{$l}/;

# 3. with a regex and `pos`:
pos($str) = $i;
printf "#3: %d\n", $str =~ /1{$l}/;

I expect all results to be 0 (false) but I get:

#1: 0
#2: 0
#3: 1

What would be the correct regex when using pos in #3?

Fravadona
  • 13,917
  • 1
  • 23
  • 35
  • @TedLyngmo they do; I'm struggling with `#3`, which IMO would be the best solution because the strings are a few hundred MB long and the parts to check are of unknown size at unknown offsets. – Fravadona Aug 13 '22 at 19:21
  • 1
    You need `\G` to anchor the match, and I think you need `/g` for pos to be used – ikegami Aug 13 '22 at 20:13
  • Thank @ikegami, anchoring it with \G fixes the issue; `/g` isn't needed. Can you post in answer? I didn't find much content on how to use `pos()` in the web – Fravadona Aug 13 '22 at 23:40
  • "_parts to check are of unknown size at unknown offsets_" -- what does that mean? (How can one check a part at an unknown position?) If it means that positions are given dynamically then that isn't limiting at all. Also, it was worth stating that strings to check are Mb in size (even though these days that isn't so critical). – zdim Aug 13 '22 at 23:45
  • @zdim I didn't phrase it well, indeed; as you guessed, the start and end positions are determined at runtime, so having the regex engine offset to the right position seems like a good approach, IMO – Fravadona Aug 14 '22 at 00:15
  • 1
    @Fravadona OK. It makes sense of course, and that is easy: just set `pos` to that and then the regex starts from there (if it's with `/g`). – zdim Aug 14 '22 at 00:49

4 Answers4

3

You can do this with a simple regex.

/\A\d{6}1{9}/

Expanded to explain...

m{
  \A     # the beginning of the string
  \d{6}  # any six digits (digits 1 to 6)
  1{9}   # 9 1's (digits 7 to 15)
         # and we don't care about the rest
}x

The usual way to do this is with a bitmask.

# Convert the string to binary.
my $str = "011000001111111111000011111110110111110000101010111";
my $num = oct("0b$str");

# This is your bitmask.
my $mask = 0b000000001111111111000000000000000000000000000000000;

# AND each bit together. If the result equals the mask, it matches.
print "Match" if ($num & $mask) == $mask;

The advantage is this is probably faster and easier to understand (if you know what a bitmask is).

The disadvantage is this will only work up to 64 digits, assuming your Perl is compiled for 64 bit, which it should be in 2022, and only for 1s and 0s.

Schwern
  • 153,029
  • 25
  • 195
  • 336
  • That you for the answer; the bitmask solution would be great but the substrings are of unknown size, and probably around a few MB. – Fravadona Aug 13 '22 at 19:06
3

Can walk through the string with regex

while (/(.)/g) { say "At pos ", pos, " is: $1" }

If the string is rather large and you'd rather start from a position further down the string, as now stated in comments, then first issue pos = $start_pos; and then run the above regex.

Or get chars at all positions and check with an array of the ones you care about

my @positions = 7 .. 15;

my @chars = split '';

for my $pos (@positions) { say "At $pos have: @chars[$pos-1]" }

But I'd rather prepare and use a bitmask as in Schwern's answer, if it's 1/0.

zdim
  • 64,580
  • 5
  • 52
  • 81
1

Two problems.

  • When you don't use /g, the regex engine starts matching at position zero. Using /g will heed the position you set.

  • When the regex engine fails to find a match at a given position, it will try to find a match at later start positions. To prevent this, you will need to anchor the match using \G.

$ perl -Mv5.14 -e'
   $_ = "xyx";
   pos = 1;
   say scalar( /\Gx/g ) ? "match at $-[0]" : "[no match]";
'
[no match]

$ perl -Mv5.14 -e'
   $_ = "xyx";
   pos = 2;
   say scalar( /\Gx/g ) ? "match at $-[0]" : "[no match]";
'
match at 2
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • is the `/g` mandatory when you have `\G` at the start of the regex? It seems to work fine without it. – Fravadona Aug 14 '22 at 00:20
  • 1
    Well, thanks to the optimizer, it usually does what you want anyway. But, in theory, it could do a *lot* of extra work since it always starts at position 0 without `/g` – ikegami Aug 14 '22 at 02:08
1

The most efficient code is based on bitmap mask described in previous answer.

In this code sample demonstrates two other ways to make described comparison but they are not efficient in comparison with bit mask method.

use strict;
use warnings;
use feature 'say';

my($sum,$str,$start,$end,$length,@chars);

$sum = 0;
#       12345678901234567890
$str = '01100000111111111100001111111011';
($start,$end) = (7,15);

$length = 1+$end-$start;

say "** Method #1";
@chars  = (split('', $str))[$start-1..$end-1];
$sum   += $_ for @chars;

say $sum == $length 
    ? 'All good' 
    : 'No dice';
    
say "** Method #2";
say $str =~ /^.{6}1{$length}/
    ? 'All good'
    : 'No dice';
    
say "
But nothing can beat binary comparison given in 
previous answer
";

Output

** Method #1
No dice
** Method #2
No dice

But nothing can beat binary mask given in
previous answer
Polar Bear
  • 6,762
  • 1
  • 5
  • 12