2

We have a problem with a regular expression on hive. We need to exclude the numbers with +37 or 0037 at the beginning of the record (it could be a false result on the regex like) and without letters or space.

We're trying with this one: regexp_like(tel_number,'^\+37|^0037+[a-zA-ZÀÈÌÒÙ ]') but it doesn't work.

Edit: we want it to come out from the select as true (correct number) or false.

wmbff
  • 45
  • 4

3 Answers3

1

To exclude numbers which start with +01 0r +001 or +0001 and having only digits without spaces or letters:

... WHERE tel_number NOT rlike '^\\+0{1,3}1\\d+$'

Special characters like + and character classes like \d in Hive should be escaped using double-slash: \\+ and \\d.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • sorry, I meant in the select (I want to see true if the number is ok, false if it isn't) – wmbff Jul 16 '21 at 09:31
  • @wmbff `select tel_number NOT rlike '^\\+0{1,3}1\\d+$' as your_col_name ...` - this will give True if number is OK and False if not – leftjoin Jul 16 '21 at 09:40
  • 1
    Note that this won't exclude numbers that start with `+01` if they are malformed in any other way like: `+01foo1234`. – Manuel Batsching Jul 16 '21 at 10:29
1

The general question is, if you want to describe a malformed telephone number in your regex and exclude everything that matches the pattern or if you want to describe a well-formed telephone number and include everything that matches the pattern.

Which way to go, depends on your scenario. From what I understand of your requirements, adding "not starting with 0037 or +37" as a condition to a well-formed telephone number could be a good approach.

The pattern would be like this:

  1. Your number can start with either + or 00: ^(\+|00)

  2. It cannot be followed by a 37 which in regex can be expressed by the following set of alternatives:

    a. It is followed first by a 3 then by anything but 7: 3[0-689]

    b. It is followed first by anything but 3 then by any number: [0-24-9]\d

  3. After that there is a sequence of numbers of undefined length (at least one) until the end of the string: \d+$

Putting everything together:

^(\+|00)(3[0-689]|[0-24-9]\d)\d+$

You can play with this regex here and see if this fits your needs: https://regex101.com/r/KK5rjE/3

Note: as leftjoin has pointed out: To use this regex in hive you might need to additionally escape the backslashes \ in the pattern.

Manuel Batsching
  • 3,406
  • 14
  • 20
1

You can use

regexp_like(tel_number,'^(?!\\+37|0037)\\+?\\d+$')

See the regex demo. Details:

  • ^ - start of string
  • (?!\+37|0037) - a negative lookahead that fails the match if there is +37 or 0037 immediately to the right of the current location
  • \+? - an optional + sign
  • \d+ - one or more digits
  • $ - end of string.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563