0

I have written the code for validating the email address using pyspark but getting invalid email address.

Input Email Address

alcaraz@lcc@uma.es

Output getting

lcc@ums.es

Expected output

"invalid email address"

code tried

df1 = df.withColumn(df.columns[0], regexp_replace(lower(df.columns[0]), "^a-zA-Z0-9@\._\-| ", ""))
    extract_expr = expr(
        "regexp_extract_all(emails, '(\\\w+([\\\.-]?\\\w+)*@\\[A-Za-z\-\.]+([\\\.-]?\\\w+)*(\\\.\\\w{2,3})+)', 0)")

    df2 = df1.withColumn(df.columns[0], extract_expr) \
        .select(df.columns[0])
Naveen
  • 81
  • 8
  • 1
    this may help : [How can I validate an email address using a regular expression?](https://stackoverflow.com/questions/201323/how-can-i-validate-an-email-address-using-a-regular-expression) – Steven Aug 18 '21 at 08:11

1 Answers1

0

There are numerous "solutions" to be found for a definitive RE that ensures conformance with RFC5322. Here's the one I use. It may not match 100% of cases.

import re

expr = r"[a-z0-9!#$%&'*+/=?^_‘{| }~-]+(?:\.[a-z0-9!#$%&'*+/=?^_‘{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?"
p = re.compile(expr)

for ema in ['boris@gov.uk', 'alcaraz@lcc@uma.es']:
    v = 'valid' if p.match(ema) else 'invalid'
    print(f'{ema} is {v}')