I have a .dat file with (\u0002\n) as row delimiter and (\u0001) as column delimiter. I am able to get only 1 record in spark DataFrame when I use this approach.
sc.hadoopConfiguration.set("textinputformat.record.delimiter", unescapeJava(rowDelim));
val header = Seq("col0", "col1", "col2")
val schema = StructType(header.map(name => StructField(name, StringType)))
// Load data as RDD
val dataFileTypVal = escapeJava("\u0001");
val datafile = sc.textFile("some dat file path")
// Convert to Row RDD
val rdd1 = datafile.map(_.split( unescapeJava(dataFileTypVal) )).map(arr => Row.fromSeq(arr))
val rdd2 = datafile.map(_.split( unescapeJava(dataFileTypVal) ).to[List]).map(arr => Row.fromSeq(arr))
// Create DataFrame from Row RDD and schema
val df1 = sqlContext.createDataFrame(rdd1, schema)
val df2 = sqlContext.createDataFrame(rdd2, schema)
But df1.show() return only first row
//df1, df2 -> return only 1 row.
+----+----+-----+
|col0|col1| col2|
+----+----------+
| A1 | B1 | C1 |
+----+----------+
But my file has 3 rows and I am able to see all 3 rows shown when I print as
rdd1.collect().foreach(println)
[A1,B1,C1
A2,B2,C2
A3,B3,C3
]
How do I get all records from .dat file into Dataframe