1

I have the following code in python(pandas), databricks. This is working fine but it is not filtering out the invalid phone numbers.

The code follows the pattern and filters out home and mobile phone numbers

import pandas as pd 
import re
from pyspark.sql.functions import lit

df = Phonevalidation

# function to check the phone number pattern
def isValid(s): 
  Pattern = re.compile("(0|44)?[7-9][0-9]{9}") 
  if(Pattern.match(s)):
    return 'Mobile Number'
  else: return 'Home phone'

#UDF Register
PhType = udf(isValid)

df1 = Phonevalidation.withColumn('Phtype' ,PhType('Phonenumber') )
display(df1)

I am expecting to filter out invalid phone number with length >10 or <10 or numbers like 0000000 or 11111 to be tagged as invalid

Rakesh
  • 81,458
  • 17
  • 76
  • 113
shama khan
  • 33
  • 9

1 Answers1

0

The code you are currently using marks with 9 digits and leading zero or UK countrycode and then a initial 7, 8 or 9 as mobile number, but everything else (including malformated ones) as home number:

  Pattern = re.compile("(0|44)?[7-9][0-9]{9}") 
  if(Pattern.match(s)):
    return 'Mobile Number'
  else: return 'Home phone'

If you are after US numbers, grep with regex for phone number might help.

I am expecting to filter out invalid phone number with length >10 or <10 or numbers like 0000000 or 11111 to be tagged as invalid

For the first part of your idea you can use as pattern like Pattern = re.compile("[0-9]{10}"), the 2nd part I would put into a pseudocode like

if (Pattern.match(s)):
   if (s != '0000000000' or s != '1111111111'):
      return: 'Fitting your criteria'
else: return 'Not valid' 
B--rian
  • 5,578
  • 10
  • 38
  • 89