I have created a Spark dataset from a csv file.
The schema is:
|-- FirstName: string (nullable = true)<br>
|-- LastName: string (nullable = true)<br>
|-- Email: string (nullable = true)<br>
|-- Phone: string (nullable = true)
I am performing deduplication on the email field:
Dataset<Row> customer= spark.read().option("header","true").option("charset","UTF8")
.option("delimiter",",").csv(path);
Dataset<Row> distinct = customer.select(col).distinct();
I would like to create an output csv file with the rows with distinct email Ids.
How to query in order to the retrieve dataset with records with distinct email?
Sample Input:
John David john.david@abc.com 2222
John Smith john.smith@abc.com 4444
John D john.david@abc.com 2222
Sample Output:
John David john.david@abc.com 2222
John Smith john.smith@abc.com 4444
Thanks in advance