7

I have a body of text I'm looking to pull repeat sets of 4-digit numbers out from.

For Example:

The first is 1234 2) The Second is 2098 3) The Third is 3213

Now I know i'm able to get the first set of digits out by simply using:

    /\d{4}/

...returning 1234

But how do I match the second set of digits, or the third, and so on...?

edit: How do i return 2098, or 3213

HamZa
  • 14,671
  • 11
  • 54
  • 75
Andy 'Drew' Dodd
  • 73
  • 1
  • 1
  • 4
  • 7
    What language are you using? – Rohit Jain Aug 24 '13 at 20:48
  • 1
    Hi Rohit. I'm using Perl. My mistake, I assumed all Regex was the same. – Andy 'Drew' Dodd Aug 24 '13 at 20:57
  • 2
    And for the record, there are several 'dialects' of regular expressions, each with its own set of supported features. For instance, RegExp in JavaScript does not support negative look-behinds which are supported by Perl-style regexps. – Rob Raisch Aug 24 '13 at 21:02
  • Here's an idea. If you want to match the second "4 digits". You first match the first digits `\d{4}`. You then match everything ungreedy and match another 4 digits. `\d{4}.*?\d{4}`. The problem now is that you have a weird match (4 digits+random data+4 digits). To solve this, you may use the `\K` modifier, it "forget's" everything what's already matched,it's a powerfull replacement for "unlimited lookbehinds".So the final expression would look like `\d{4}.*?\K\d{4}`.You should get the second 4 digits. Now let's just hope your system is based on [Perl 5.10+](http://stackoverflow.com/q/13542950) – HamZa Aug 24 '13 at 21:20
  • [See the expression in action when it's supported !!!](http://regex101.com/r/zH5kO0) – HamZa Aug 24 '13 at 21:21
  • Upvotes for HamZa! Thank you, works a charm. How do I progress to the third, fourth and so on? – Andy 'Drew' Dodd Aug 24 '13 at 21:34
  • @Andy'Drew'Dodd Please don't forget to "ping" by using "@" otherwise I won't notice. Anyways, you should be using `\d{4}.*?\K\d{4}` for the second, `\d{4}.*?\d{4}.*?\K\d{4}` for the third and `\d{4}.*?\d{4}.*?\d{4}.*?\K\d{4}` for the forth and so on ... – HamZa Aug 24 '13 at 21:38
  • @HamZa wow, interesting service! Which regexs it supports? – gaussblurinc Aug 25 '13 at 09:00
  • @loldop That's based on PCRE PHP, but I've got a lot of other services for other languages :P For example this [one](http://www.rexv.org/) it supports PHP PCRE, Perl PCRE, Python, JS ... – HamZa Aug 25 '13 at 09:08

5 Answers5

11

You don't appear to have a proper answer to your question yet.

The solution is to use the /g modifier on your regex. In list context it will find all of the numbers in your string at once, like this

my $str = 'The first is 1234 2) The Second is 2098 3) The Third is 3213';

my @numbers = $str =~ /\b \d{4} \b/gx;

print "@numbers\n";

output

1234 2098 3213

Or you can iterate through them, using scalar context in a while loop, like this

while ($str =~ /\b (\d{4}) \b/gx) {
  my $number = $1;
  print $number, "\n";
}

output

1234
2098
3213

I have added the \b patterns to the regex so that it only matches whole four-digit numbers and doesn't, for example, find 1234 in 1234567. The /x modifier just allows me to add spaces so that the pattern is more intelligible.

amon
  • 57,091
  • 2
  • 89
  • 149
Borodin
  • 126,100
  • 9
  • 70
  • 144
1

See http://perldoc.perl.org/perlre.html for discussion on the use of the 'g' modifier which will cause your regular expression to match ALL occurrances of its pattern, not just the first.

Rob Raisch
  • 17,040
  • 4
  • 48
  • 58
  • I am using a system that only accepts a regular expression as part of a function, it takes only the first match and doesn't allow me to use modifiers like 'g'. I would be looking for a syntax that would say "give me the 2nd match of \d{4}\. Not sure if I'm making sense. – Andy 'Drew' Dodd Aug 24 '13 at 21:03
  • 2
    What does the documentation of the function you're using say about matching multiple copies of a pattern? What is the function? – Rob Raisch Aug 24 '13 at 21:05
1

If you want a pattern that finds the $n'th 4-digit group, this seems to work:

$pat = "^(?:.*?\\b(\\d{4})\\b){$n}";
if ($s =~ /$pat/) {
   print "Found $1\n";
} else {
   print "Not found\n";
}

I did this by building a string pattern because I couldn't get a variable interpolated into a quantifier {$n}.

This pattern finds 4-digit groups that are on word boundaries (the \b tests); I don't know if that meets your requirements. The pattern uses .*? to ensure that as few characters as possible are matched between each four-digit group. The pattern is matched $n times, and the capture group $1 is set to whatever it was in the last iteration, i.e. the $n'th one.

EDIT: When I just tried it again, it seemed to interpolate $n in a quantifier just fine. I don't know what I did differently that it didn't work last time. So maybe this will work:

if ($s =~ /^(?:.*?\b(\d{4}\b){$n}/) { ...

If not, see amon's comment about qr//.

ajb
  • 31,309
  • 3
  • 58
  • 84
  • 1
    Ah, the dreaded double backslash. Protip: Use regex quotes `qr//`. Then: `qr/^(?: .*? \b(\d{4})\b ){$n}/x` – amon Aug 25 '13 at 07:37
0

If the regex is only matched once, then match all three in one regex and extract them using matched groups:

^.*\b(\d{4})\b.*\b(\d{4})\b.*\b(\d{4})\b.*$

The three 4-digit numbers will be captured in group 1. 2 and 3.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • 1
    I think this will cause problems with the OP's example because the source contains "1)" and "2)", and that's going to fail the `\D+` test. – ajb Aug 24 '13 at 21:51
  • 1
    Yes, that should be better. `\b` is the approach others and I were taking, but if the OP wants to extract 1234 out of ANumber1234InTheMiddleOfAWord then we'd need something different. We don't really know his exact requirements. – ajb Aug 24 '13 at 23:57
0

Ajb's answer with "gx" is the best. If you know you will have three numbers, this straighforward line does the trick:

my $str = 'The first is 1234 2) The Second is 2098 3) The Third is 3213';
my ($num1, $num2, $num3) = $str =~ /\b \d{4} \b/gx;
print "$num1, $num2, $num3\n";
AlexD
  • 561
  • 6
  • 11