0

I want to detect strings that have a user's age in them, for example :

"I'm 24 years old" "J'ai 25 ans"

So essentially it would look for :

  • starts zero or more characters (any)
  • followed by two digits - match Arabic (0,1,2,etc) and Hindi (٠,١,٢, etc) numerals
  • followed by one of the 'age' words (years, ans, etc)
  • end with zero or more other characters (any)

I've used :

/^[0-9]{2} +(ans|year)$/

so far but it only matches very specific strings like "24 year"

Barmar
  • 741,623
  • 53
  • 500
  • 612
Sherif Buzz
  • 1,218
  • 5
  • 21
  • 38

3 Answers3

1

One possible approach might be

\b\p{N}+\s+(?:an|year)s?

which could be used for example in a lookahead. See a demo on regex101.com.

Your initial expression uses anchors, that is your substring can only be matched at the beginning and the end.

Jan
  • 42,290
  • 8
  • 54
  • 79
  • Thanks that works perfectly for Arabic/Western Numerals (1,2,3, etc). How would I also make it match Hindi numerals (which are confusingly the numbers used in Arabic). https://stackoverflow.com/questions/29729391/regular-expression-arabic-characters-and-numbers-only – Sherif Buzz Jul 15 '19 at 17:43
  • @SherifBuzz: Aha. Changed it to `\p{N}` which should match any digits in `PCRE`. – Jan Jul 15 '19 at 17:52
  • unfortunately doesn't seem to work. I've updated the demo : https://regex101.com/r/Re0J1Y/4/ – Sherif Buzz Jul 15 '19 at 18:10
0

Get rid of the ^ and $. They match the beginning and end of the string, so it won't work if you have I am at the beginning or old at the end.

If you want to match whole words, use \b instead.

/\b\d{2} +(ans|years)\b/

And if you want to match numerals other than Arabic, use \d instead of [0-9].

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • 2
    If you want that `\d` contains digits other than the ASCII, you have to use the u modifier (that extends shorthand character classes to unicode and turns the word-boundaries `\b` unicode aware). Also the u modifier forces the regex engine to read the string code point by code point instead of byte by byte (that is the default behavior). – Casimir et Hippolyte Jul 15 '19 at 19:35
0

Not sure if I have picked the right words, yet you might want to design an expression similar to:

\s+\p{N}{1,3}\s+(?:years?|an(?:née)?s|سنة|سنوات|عاما|साल)

DEMO

The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.

Test

$re = '/\s+\p{N}{1,3}\s+(?:years?|an(?:née)?s|سنة|سنوات|عاما|साल)/m';
$str = 'I\'m 24 years old
J\'ai 25 ans
I have 25 year
عندي ٢٣ سنة
I\'m  24  years old
मैं 27 साल का हूँ
J\'ai  25  ans
I have 100  year
أنا 27 عاما
عندي  ٢٣  سنة';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

foreach ($matches as $match) {
    print(trim($match[0]) . "\n");
}

Output

24 years
25 ans
25 year
24  years
27 साल
25  ans
100  year
27 عاما
Emma
  • 27,428
  • 11
  • 44
  • 69