4

I have a string I need to parse. It meets the following requirements:

  • It is comprised of 0 or more key->value pairs.
  • The key is always 2 letters.
  • The value is one or more numbers.
  • There will not be a space between the key and value.
  • There may or may not be a space between individual pairs.

Example strings I may see:

  • AB1234 //One key->value pair (Key=AB, Value=1234)
  • AB1234 BC2345 //Two key->value pairs, separated by space
  • AB1234BC2345 //Two key->value pairs, not separated by space
  • //Empty Sting, No key->value pairs
  • AB12345601BC1234CD1232PE2343 //Lots of key->value pairs, no space
  • AB12345601 BC1234 CD1232 PE2343 //Lots of key->value pairs, with spaces

I need to build a Perl hash of this string. If I could guarantee it was 1 pair I would do something like this:

$string =~ /([A-Z][A-Z])([0-9]+)/
$key = $1
$value = $2
$hash{$key} = $value

For multiple strings, I could potentially do something where after each match of the above regex, I take a substring of the original string (exempting the first match) and then search again. However, I'm sure there's a more clever, perl-esque way to achieve this.

Wishing I didn't have such a crappy data source to deal with-

Jonathan

Jonathan
  • 3,464
  • 9
  • 46
  • 54
  • See also [How can I store regex captures in an array in Perl?](http://stackoverflow.com/questions/2304577/). – outis Nov 25 '11 at 23:09

3 Answers3

8

In a list context with the global flag, a regex will return all matched substrings:

use Data::Dumper;

@strs = (
    'AB1234',
    'AB1234 BC2345',
    'AB1234BC2345',
    '',
    'AB12345601BC1234CD1232PE2343',
    'AB12345601 BC1234 CD1232 PE2343'
);

for $str (@strs) {
    # The money line
    %parts = ($str =~ /([A-Z][A-Z])(\d+)/g);

    print Dumper(\%parts);
}

For greater opacity, remove the parentheses around the pattern matching: %parts = $str =~ /([A-Z][A-Z])(\d+)/g;.

outis
  • 75,655
  • 22
  • 151
  • 221
3

You are already there:

$hash{$1} = $2 while $string =~ /([[:alpha:]]{2})([0-9]+)/g
choroba
  • 231,213
  • 25
  • 204
  • 289
0

Assuming your strings are definitely going to match your scheme (i.e. there won't be any strings of the form A122 or ABC123), then this should work:

my @strings = ( 'AB1234', 'AB1234 BC2345', 'AB1234BC2345' );

foreach my $string (@strings) {
    $string =~ s/\s+//g;
    my ( $first, %elems ) = split(/([A-Z]{2})/, $string);
    while (my ($key,$value) = each %elems) {
        delete $elems{$key} unless $key =~ /^[A-Z]{2}$/;
        delete $elems{$key} unless $value =~ /^\d{4}$/;
    }
    print Dumper \%elems;
}
CanSpice
  • 34,814
  • 10
  • 72
  • 86
  • The pure regex answers look a little cleaner. I was just trying something different with `split`. :-) – CanSpice Nov 25 '11 at 23:10
  • If it all comes in one string you could do something like `$string =~ s/\s+//g; my %h = map{split/(?<=\D)(?=\d)/}split/(?<=\d)(?=\D)/, $string;` – flesk Nov 25 '11 at 23:30
  • Or simply `%h = split /\s*(\d+)\s*/, $string` – TLP Nov 26 '11 at 02:02