How to validate multiple emailaddress using regexp in pyspark

Question

I have multiple email addresses within a field and from the dataframe, I have to validate if the email address has @ .com and separated by a ; delimiter.

a
-----------------------------------------------
sample@email.com;sample2@email.com
sample
sample@email.com
sample2@email.com;test2@email.,sample@email.com

Expected output :

a                                                 a_new
---------------------------------------------------------
sample@email.com;sample2@email.com                Valid
sample                                            Invalid
sample@email.com                                  Valid
sample2@email.com;test2@email.,sample@email.com   Invalid

The 2nd and fourth records are invalid because of @ and .com are missing even for a single email address and for multiple email addresses test2@email., -> com is missing along with a different delimiter.

I was able to pull out for a single email address test. Not sure how to test if there are multiple email addresses.

Show us the code you're using right now? One option for validating multiple addresses would be to split on `;` and then validate each of the resulting items. — larsks, Jan 27 '21 at 22:30
@Santosh, you posted a brilliant question `pyspark/hive count using window function` but deleted. I have a solution. Let me know if you still need help and will post the answer — wwnde, Feb 01 '22 at 12:59

score 0 · Answer 1 · answered Jan 27 '21 at 23:01

For complexe email validation Regex you can see this post.

But if you want only to verify an email has the form anything@anything.com, you can use this simple regex .+@.+\.com and to check there is a list of emails separated by ; use : ^(.+@.+\.com)(; .+@.+\.com)*$ with the function rlike:

from pyspark.sql import functions as F

data = [
    ("sample@email.com;sample2@email.com",),
    ("sample",),
    ("sample@email.com",),
    ("sample2@email.com;test2@email.,sample@email.com ",)
]
df = spark.createDataFrame(data, ["a"])

df1 = df.withColumn("a_new",
                    F.when(
                        F.col("a").rlike("^(.+@.+\\.com)(; .+@.+\\.com)*$"),
                        "Valid"
                    ).otherwise("Invalid")
                  )

df1.show(truncate=False)

#+------------------------------------------------+-------+
#|a                                               |a_new  |
#+------------------------------------------------+-------+
#|sample@email.com;sample2@email.com              |Valid  |
#|sample                                          |Invalid|
#|sample@email.com                                |Valid  |
#|sample2@email.com;test2@email.,sample@email.com |Invalid|
#+------------------------------------------------+-------+

How to validate multiple emailaddress using regexp in pyspark

1 Answers1