Regex: Matching 4-Digits within words

Question

I have a body of text I'm looking to pull repeat sets of 4-digit numbers out from.

For Example:

The first is 1234 2) The Second is 2098 3) The Third is 3213

Now I know i'm able to get the first set of digits out by simply using:

    /\d{4}/

...returning 1234

But how do I match the second set of digits, or the third, and so on...?

edit: How do i return 2098, or 3213

Hi Rohit. I'm using Perl. My mistake, I assumed all Regex was the same. — Andy 'Drew' Dodd, Aug 24 '13 at 20:57
And for the record, there are several 'dialects' of regular expressions, each with its own set of supported features. For instance, RegExp in JavaScript does not support negative look-behinds which are supported by Perl-style regexps. — Rob Raisch, Aug 24 '13 at 21:02
Here's an idea. If you want to match the second "4 digits". You first match the first digits `\d{4}`. You then match everything ungreedy and match another 4 digits. `\d{4}.*?\d{4}`. The problem now is that you have a weird match (4 digits+random data+4 digits). To solve this, you may use the `\K` modifier, it "forget's" everything what's already matched,it's a powerfull replacement for "unlimited lookbehinds".So the final expression would look like `\d{4}.*?\K\d{4}`.You should get the second 4 digits. Now let's just hope your system is based on [Perl 5.10+](http://stackoverflow.com/q/13542950) — HamZa, Aug 24 '13 at 21:20
[See the expression in action when it's supported !!!](http://regex101.com/r/zH5kO0) — HamZa, Aug 24 '13 at 21:21
Upvotes for HamZa! Thank you, works a charm. How do I progress to the third, fourth and so on? — Andy 'Drew' Dodd, Aug 24 '13 at 21:34
@Andy'Drew'Dodd Please don't forget to "ping" by using "@" otherwise I won't notice. Anyways, you should be using `\d{4}.*?\K\d{4}` for the second, `\d{4}.*?\d{4}.*?\K\d{4}` for the third and `\d{4}.*?\d{4}.*?\d{4}.*?\K\d{4}` for the forth and so on ... — HamZa, Aug 24 '13 at 21:38
@loldop That's based on PCRE PHP, but I've got a lot of other services for other languages :P For example this [one](http://www.rexv.org/) it supports PHP PCRE, Perl PCRE, Python, JS ... — HamZa, Aug 25 '13 at 09:08

score 11 · Accepted Answer · edited Aug 25 '13 at 07:32

You don't appear to have a proper answer to your question yet.

The solution is to use the /g modifier on your regex. In list context it will find all of the numbers in your string at once, like this

my $str = 'The first is 1234 2) The Second is 2098 3) The Third is 3213';

my @numbers = $str =~ /\b \d{4} \b/gx;

print "@numbers\n";

output

1234 2098 3213

Or you can iterate through them, using scalar context in a while loop, like this

while ($str =~ /\b (\d{4}) \b/gx) {
  my $number = $1;
  print $number, "\n";
}

output

1234
2098
3213

I have added the \b patterns to the regex so that it only matches whole four-digit numbers and doesn't, for example, find 1234 in 1234567. The /x modifier just allows me to add spaces so that the pattern is more intelligible.

score 1 · Answer 2 · answered Aug 24 '13 at 21:00

1

See http://perldoc.perl.org/perlre.html for discussion on the use of the 'g' modifier which will cause your regular expression to match ALL occurrances of its pattern, not just the first.

answered Aug 24 '13 at 21:00

Rob Raisch

17,040
4
48
58

I am using a system that only accepts a regular expression as part of a function, it takes only the first match and doesn't allow me to use modifiers like 'g'. I would be looking for a syntax that would say "give me the 2nd match of \d{4}\. Not sure if I'm making sense. – Andy 'Drew' Dodd Aug 24 '13 at 21:03
2

What does the documentation of the function you're using say about matching multiple copies of a pattern? What is the function? – Rob Raisch Aug 24 '13 at 21:05

ajb · Answer 3 · 2013-08-25T18:11:10.280

If you want a pattern that finds the $n'th 4-digit group, this seems to work:

$pat = "^(?:.*?\\b(\\d{4})\\b){$n}";
if ($s =~ /$pat/) {
   print "Found $1\n";
} else {
   print "Not found\n";
}

I did this by building a string pattern because I couldn't get a variable interpolated into a quantifier {$n}.

This pattern finds 4-digit groups that are on word boundaries (the \b tests); I don't know if that meets your requirements. The pattern uses .*? to ensure that as few characters as possible are matched between each four-digit group. The pattern is matched $n times, and the capture group $1 is set to whatever it was in the last iteration, i.e. the $n'th one.

EDIT: When I just tried it again, it seemed to interpolate $n in a quantifier just fine. I don't know what I did differently that it didn't work last time. So maybe this will work:

if ($s =~ /^(?:.*?\b(\d{4}\b){$n}/) { ...

If not, see amon's comment about qr//.

Ah, the dreaded double backslash. Protip: Use regex quotes `qr//`. Then: `qr/^(?: .*? \b(\d{4})\b ){$n}/x` — amon, Aug 25 '13 at 07:37

Bohemian · Answer 4 · 2013-08-24T22:42:50.377

0

If the regex is only matched once, then match all three in one regex and extract them using matched groups:

^.*\b(\d{4})\b.*\b(\d{4})\b.*\b(\d{4})\b.*$

The three 4-digit numbers will be captured in group 1. 2 and 3.

edited Aug 24 '13 at 22:42

answered Aug 24 '13 at 21:33

Bohemian

412,405
93
575
722

1

I think this will cause problems with the OP's example because the source contains "1)" and "2)", and that's going to fail the `\D+` test. – ajb Aug 24 '13 at 21:51
1

Yes, that should be better. `\b` is the approach others and I were taking, but if the OP wants to extract 1234 out of ANumber1234InTheMiddleOfAWord then we'd need something different. We don't really know his exact requirements. – ajb Aug 24 '13 at 23:57

score 0 · Answer 5 · answered Jun 25 '14 at 18:16

Ajb's answer with "gx" is the best. If you know you will have three numbers, this straighforward line does the trick:

my $str = 'The first is 1234 2) The Second is 2098 3) The Third is 3213';
my ($num1, $num2, $num3) = $str =~ /\b \d{4} \b/gx;
print "$num1, $num2, $num3\n";

Regex: Matching 4-Digits within words

5 Answers5