0

I have a CSV file which only has data but doesn't have the column name, now I want to create a dataframe in Spark by using the data of this CSV file and create the schema(column name and datatype) for it. My code as below:

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)

val employee = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("inferSchema", "true")
.load("csv filename")

I want to know what commands I need to add in the code to build the schema in my dataframe?

David Guo
  • 31
  • 6

2 Answers2

0

When there is no schema, spark names the columns as c0, c1, c2 and so on. You have to provide schema while reading the csv. Please have a look at this link, this will help you in resolving your issue.

Apurba Pandey
  • 1,061
  • 10
  • 21
0

You have to give the column names if you want to provide the schema. So dynamically, all you can infer is datatypes. To add column names, it would have to be fixed I suppose, but there is just no other way to add column names that make sense without actually adding them yourself.

You just need a sequence of Structfield, or some such collection, passed through the StructType constructor.

   import org.apache.spark.sql._

   val yourSchema =
     StructType(
         StructField("colA”, IntegerType, true) ::
         StructField("colB”, LongType, false) ::
         StructField("colC”, BooleanType, false) :: Nil)
uh_big_mike_boi
  • 3,350
  • 4
  • 33
  • 64