1

I have multiple email addresses within a field and from the dataframe, I have to validate if the email address has @ .com and separated by a ; delimiter.

a
-----------------------------------------------
sample@email.com;sample2@email.com
sample
sample@email.com
sample2@email.com;test2@email.,sample@email.com

Expected output :

a                                                 a_new
---------------------------------------------------------
sample@email.com;sample2@email.com                Valid
sample                                            Invalid
sample@email.com                                  Valid
sample2@email.com;test2@email.,sample@email.com   Invalid

The 2nd and fourth records are invalid because of @ and .com are missing even for a single email address and for multiple email addresses test2@email., -> com is missing along with a different delimiter.

I was able to pull out for a single email address test. Not sure how to test if there are multiple email addresses.

blackbishop
  • 30,945
  • 11
  • 55
  • 76
Santosh
  • 21
  • 2
  • Show us the code you're using right now? One option for validating multiple addresses would be to split on `;` and then validate each of the resulting items. – larsks Jan 27 '21 at 22:30
  • @Santosh, you posted a brilliant question `pyspark/hive count using window function` but deleted. I have a solution. Let me know if you still need help and will post the answer – wwnde Feb 01 '22 at 12:59

1 Answers1

0

For complexe email validation Regex you can see this post.

But if you want only to verify an email has the form anything@anything.com, you can use this simple regex .+@.+\.com and to check there is a list of emails separated by ; use : ^(.+@.+\.com)(; .+@.+\.com)*$ with the function rlike:

from pyspark.sql import functions as F

data = [
    ("sample@email.com;sample2@email.com",),
    ("sample",),
    ("sample@email.com",),
    ("sample2@email.com;test2@email.,sample@email.com ",)
]
df = spark.createDataFrame(data, ["a"])

df1 = df.withColumn("a_new",
                    F.when(
                        F.col("a").rlike("^(.+@.+\\.com)(; .+@.+\\.com)*$"),
                        "Valid"
                    ).otherwise("Invalid")
                  )

df1.show(truncate=False)

#+------------------------------------------------+-------+
#|a                                               |a_new  |
#+------------------------------------------------+-------+
#|sample@email.com;sample2@email.com              |Valid  |
#|sample                                          |Invalid|
#|sample@email.com                                |Valid  |
#|sample2@email.com;test2@email.,sample@email.com |Invalid|
#+------------------------------------------------+-------+
blackbishop
  • 30,945
  • 11
  • 55
  • 76