0

I have a .dat file with (\u0002\n) as row delimiter and (\u0001) as column delimiter. I am able to get only 1 record in spark DataFrame when I use this approach.

sc.hadoopConfiguration.set("textinputformat.record.delimiter", unescapeJava(rowDelim));
    val header = Seq("col0", "col1", "col2")
    val schema = StructType(header.map(name => StructField(name, StringType)))

// Load data as RDD
val dataFileTypVal = escapeJava("\u0001");
val datafile = sc.textFile("some dat file path")

// Convert to Row RDD

val rdd1 = datafile.map(_.split( unescapeJava(dataFileTypVal) )).map(arr => Row.fromSeq(arr))
val rdd2 =  datafile.map(_.split( unescapeJava(dataFileTypVal) ).to[List]).map(arr => Row.fromSeq(arr)) 

// Create DataFrame from Row RDD and schema

val df1 = sqlContext.createDataFrame(rdd1, schema)
val df2 = sqlContext.createDataFrame(rdd2, schema)

But df1.show() return only first row 

//df1, df2 -> return only 1 row.

+----+----+-----+
|col0|col1| col2|
+----+----------+
| A1 | B1 | C1  |
+----+----------+

But my file has 3 rows and I am able to see all 3 rows shown when I print as

rdd1.collect().foreach(println)
[A1,B1,C1
 A2,B2,C2
 A3,B3,C3
]

How do I get all records from .dat file into Dataframe

Muru
  • 1
  • 1
  • what does `unescapeJava` function do and what is `dataFileTypVal`? – Ramesh Maharjan Mar 06 '18 at 05:50
  • Possible duplicate of [Spark: Reading files using different delimiter than new line](https://stackoverflow.com/questions/25259425/spark-reading-files-using-different-delimiter-than-new-line) – Xavier Guihot Mar 06 '18 at 06:48
  • scala> val rowDelim = escapeJava("\u0002\\n"); rowDelim: String = \u0002\\n scala> val dataFileTypVal = escapeJava("\u0001") dataFileTypVal: String = \u0001 – Muru Mar 06 '18 at 18:54

0 Answers0