0

I am writing a Spark project using Scala in which I need to make some calculations from "demo" datasets. I am using databricks platform.

I need to pass the 2nd column of my Dataframe (trainingCoordDataFrame) into a list. The type of the list must be List[Int].

The dataframe is as shown bellow:

> +---+---+---+---+
> |_c0|_c1|_c2|_c3|
> +---+---+---+---+
> |1  |0  |0  |a  |
> |11 |9  |1  |a  |
> |12 |2  |7  |c  |
> |13 |2  |9  |c  |
> |14 |2  |4  |b  |
> |15 |1  |3  |c  |
> |16 |4  |6  |c  |
> |17 |3  |5  |c  |
> |18 |5  |3  |a  |
> |2  |0  |1  |a  |
> |20 |8  |9  |c  |
> |3  |1  |0  |b  |
> |4  |3  |4  |b  |
> |5  |8  |7  |b  |
> |6  |4  |9  |b  |
> |7  |2  |5  |a  |
> |8  |1  |9  |a  |
> |9  |3  |6  |a  |
> +---+---+---+---+

I am trying to create the list I want using the following command:

val trainingCoordList = trainingCoordDataFrame.select("_c1").collect().map(each => (each.getAs[Int]("_c1"))).toList

The message from the compiler is this:

java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer

Note that the procedure is :

1) Upload the dataset from local PC to databricks (so no standard data can be used).

val mainDataFrame = spark.read.format("csv").option("header", "false").load("FileStore/tables/First_Spacial_Dataset_ByAris.csv")

2) Create dataframe. ( Step one: Split the main Dataframe randomly. Step two : Remove the unnecessary columns)

val Array(trainingDataFrame,testingDataFrame) = mainDataFrame.randomSplit(Array(0.8,0.2)) //step one
val trainingCoordDataFrame = trainingDataFrame.drop("_c0", "_c3") //step two

3) Create list. <- Here is the false command.

What is the correct way to reach the result I want?

Aris Kantas
  • 375
  • 1
  • 5
  • 15
  • 1
    Check this post: https://stackoverflow.com/questions/29383107/how-to-change-column-types-in-spark-sqls-dataframe – Dominik Wosiński May 06 '19 at 18:26
  • Looks like column "_c1" type is String, casting to Integer is required, smth. like: trainingCoordDataFrame.select($"_c1".cast(IntegerType)) – pasha701 May 06 '19 at 19:20

2 Answers2

2

I think there are several ways to deal with this problem.

A) Define a schema for your CSV:

For example:

  val customSchema = StructType(Array(
    StructField("_c0", IntegerType),
    StructField("_c1", IntegerType),
    StructField("_c2", IntegerType),
    StructField("_c3", StringType)))

When you read the CSV add the schema option with the StructType we created earlier

val mainDataFrame = spark.read.format("csv").option("header", "false").schema(customSchema).load("FileStore/tables/First_Spacial_Dataset_ByAris.csv")

Now if we look at the output of the mainDataFrame.printSchema() command we'll see that the columns are typed according to your use case:

root
  |-- _c0: integer (nullable = true)
  |-- _c1: integer (nullable = true)
  |-- _c2: integer (nullable = true)
  |-- _c3: string (nullable = true)

This means we can actually run your original command without getting an error.

trainingCoordDataFrame.select("_c2").map(r => r.getInt(0)).collect.toList

B) Cast the entire column to Int

Refer to the column itself instead of the column name and then cast the column to IntegerType. Now that the column type is Int you can again use getInt where it failed earlier:

trainingCoordDataFrame.select($"_c2".cast(IntegerType)).map(r => r.getInt(0)).collect.toList

C) Cast each value individually

Use map to cast to or retrieve as String each individual value and then cast it to Int

trainingCoordDataFrame.select("_c2").map(r => r.getString(0).toInt).collect.toList
EnvyChan
  • 56
  • 5
1

The column's value is of type string so read the column as string and use scala's string.toInt method. A cast is definitely wrong at this place.

val trainingCoordList = trainingCoordDataFrame.select("_c1").collect().map(each => each.getAs[String]("_c1").toInt).toList

Or use the Dataset API with custom schema e.g. with tuples

Benjamin Urquhart
  • 1,626
  • 12
  • 18
abanjan
  • 11
  • 2