2

I am trying to read data from csv using Scala and Spark but the values of columns are null.

I tried to read data from csv. I also provided a schema for querying the data easily.

private val myData= sparkSession.read.schema(createDataSchema).csv("data/myData.csv")
def createDataSchema = {
    val schema = StructType(
      Array(
        StructField("data_index",StringType, nullable = false),
        StructField("property_a",IntegerType, nullable = false),
        StructField("property_b",IntegerType, nullable = false),
        //some other columns
     )
   )

   schema

Querying data:

val myProperty= accidentData.select($"property_b")
myProperty.collect()

I expect that the data are returned as a List of certain values

but they are returned as a list containing null values (values are null). Why?

When I print the schema then nullable is set to true instead of false.

I am using Scala 2.12.9 and Spark 2.4.3.

eisem
  • 185
  • 1
  • 10
  • What does the csv contain? – Rakshith Sep 11 '19 at 15:15
  • 1
    Your dataframe is `myData` and querying to `accidentData`. – Lamanus Sep 11 '19 at 15:20
  • it should be myData because I would like to anonymize the variable names. So it should be `val my property = myData.select($"property_b")` . The original csv contains data about accidents in the UK taken from [kaggle](https://www.kaggle.com/daveianhickey/2000-16-traffic-flow-england-scotland-wales) – eisem Sep 11 '19 at 15:26
  • can you add a screenshot of your csv ? – firsni Sep 11 '19 at 15:27
  • The csv file is a huge one with about 33 columns and over 500,000 rows. – eisem Sep 11 '19 at 15:30

1 Answers1

0

While loading data from CSV file though schema has been provided as nullable = false, Still Spark overwrites schema as nullable = true, so that null pointer could be avoided during data load.

Let us take an example, let's assume CSV file has two rows with second-row has an empty or null column value.

CSV:
a,1,2
b,,2

If nullable = false, a null pointer exception would be thrown while loading data when an action has called on the data frame, as there is empty/null value to be loaded & there is no default value a Null pointer is thrown. So to avoid it Spark overwrites it as nullable = true.

However, this could be handled by replacing all null with a default value and then re-applying schema.

val df = spark.read.schema(schema).csv("data/myData.csv")
val dfWithDefault = df.withColumn("property_a", when(col("property_a").isNull, 0).otherwise(df.col("property_a")))
val dfNullableFalse = spark.sqlContext.createDataFrame(dfWithDefault.rdd, schema)
dfNullableFalse.show(10)

df.printSchema()
root
|-- data_index: string (nullable = true)
|-- property_a: integer (nullable = true)
|-- property_b: integer (nullable = true)

dfNullableFalse.printSchema()
root
|-- data_index: string (nullable = false)
|-- property_a: integer (nullable = false)
|-- property_b: integer (nullable = false)
hagarwal
  • 1,153
  • 11
  • 27
  • When I run this I receive an execution exception: SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 – eisem Sep 12 '19 at 11:49
  • What is the spark version? Are you running locally or on cluster? If on cluster what is memory per executor & number of executor? Input CSV size? – hagarwal Sep 12 '19 at 11:58
  • I have Spark version 2.4.3 and I am running it locally on my computer. The size of the CSV file is 156 MB. Also I am using Scala 2.12.9 – eisem Sep 12 '19 at 12:06
  • I have tried some other csv files according to the null issue. Smaller files can be read without any problems. Only the data from huge files are null. I tested a file with 1 column and 500 000 rows and another file with 3 columns and 10 rows. – eisem Sep 13 '19 at 15:39
  • 2
    I found a solution: to read large csv files I had to set some Options: `sparkSession.read .option("header","true") .option("inferSchema","true") .csv("myCsvFile.csv")` I found the solution [here](https://stackoverflow.com/questions/41410209/how-to-load-large-csv-with-many-fields-to-spark). – eisem Sep 16 '19 at 07:08