How to remove entries that contain empty values in an RDD that contains csv data?

Question

I am trying to map the values from a csv file into an RDD but I get the following error because some of the fields are null.

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.NumberFormatException: empty String

Following is the code I am using.

// Load and parse the data
val data = sc.textFile("data.csv")

val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble))).cache()

Is there any way to check if there is a null? I thought of doing it with a try catch method but it doesn't seem to work.

val parsedData = data.map(s => {

  try {
    val vector = Vectors.dense(s.split(',').map(_.toDouble))
  }catch{
    case e:NumberFormatException => println("Nulls somewhere")
  }
  (vector)
})

spark-csv package can be used to read the csv data. Refer this https://stackoverflow.com/questions/29704333/spark-load-csv-file-as-dataframe. If you want the underlying RDD, call `rdd()` on Dataframe object. — shriyog, Jan 06 '19 at 16:19

Andronicus · Accepted Answer · 2019-01-06T18:27:09.950

2

You can filter out item, that are empty, just add filter method into your stream:

val parsedData = data.map(s => Vectors.dense(s.split(',').filter(!_.isEmpty).map(_.toDouble))).filter(_.size != 0)

This way any empty line will result in empty Vector, that can be further filtered.

edited Jan 06 '19 at 18:27

answered Jan 06 '19 at 17:46

Andronicus

25,419
17
47
88

Thank you for your answer. Since I'm new to scala and not accustomed with the syntax/expressions,can you write the code for how to filter it further?I wanna filter the empty Vector completely from parsedData,since I wanna use it for KMeans algorithm. Thanks again my friend – Alastor Jan 06 '19 at 18:23
Sure, just add `filter(_.size != 0)`, which means, that only arrays with size at least 1 will get through. I'll edit my answer. – Andronicus Jan 06 '19 at 18:26

How to remove entries that contain empty values in an RDD that contains csv data?

1 Answers1