0

Imagine a csv as follow :

a,b,c,d
1,1,0,0
0,1,1,0
...

I want to obtain automatically a DF with 4 columns a,b,c,d.

A manual technique can be :

val rdd = sc.textFile(path).map(_.split(","))
val cols = (0 until rdd.first.size).map(_.toString).toArray
val df = rdd.map{ case Array(a, b, c, d) => (a, b, c, d) }.toDF(cols:_*)

The problem with this technique is that i have to precise manually the number of columns a,b,c,d which can be problematic with hundreds or more features.

I imagine that it exist a more useful method probably passing by sparkSession but i don't want to have to precise any schema.

KyBe
  • 842
  • 1
  • 14
  • 33

2 Answers2

1

Spark can automatically infer the schema for you when reading a data file. If you have a CSV file with headers, you can use

val df = spark.read.option("header", "true").csv(path)

Given your example, it'll result in (using df.show()):

+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|  1|  1|  0|  0|
|  0|  1|  1|  0|
+---+---+---+---+
Midiparse
  • 4,701
  • 7
  • 28
  • 48
0

You can use Row and schema:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

val header = rdd.first

spark.createDataFrame(
  rdd.map(row => Row.fromSeq(row.take(header.size))), 
  StructType(header map (StructField(_, StringType)))
)

but here, just use Spark CSV reader.

but i don't want to have to precise any schema.

There is really nothing you can do about it. DataFrames require schema. It can be provided either explicitly as DataType or implicitly by reflection and with unknown number of columns, you'd need a lot of metaprogramming magic, to generate required case classes on runtime.

Related:

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115