How to specify Hadoop Configuration when reading CSV

Question

I am using Spark 2.0.2. How can I specify the Hadoop configuration item textinputformat.record.delimiter for the TextInputFormat class when reading a CSV file into a Dataset?

In Java I can code: spark.read().csv(<path>); However, there doesn't seem to be a way to provide a Hadoop configuration specific to the read.

It is possible to set the item using the spark.sparkContext().hadoopConfiguration() but that is global.

Thanks,

In Spark 2.1.0 it is no longer possible to use the spark-csv package at all. If the text file is in UTF-8 (default) then the Hadoop class TextInputFormat is not used and an internal Spark one is. I have changed to using the newAPIHadoopFile method and parsing the CSV file in a mapPartitions method. — Paul S, Feb 08 '17 at 15:48

score 0 · Answer 1 · answered Dec 07 '16 at 03:26

0

You cannot. Data Source API uses its own configuration which, as of 2.0 is not even compatible with Hadoop configuration.

If you want to use custom input format or other Hadoop configuration use SparkContext.hadoopFile, SparkContext.newAPIHadoopRDD or related classes.

answered Dec 07 '16 at 03:26

user7260190

1

Thanks. I know I can specify a specific Hadoop configuration with the newAPIHadoopFile method. However this bypasses the CSV support and requires conversion to a Dataset. – Paul S Dec 09 '16 at 15:51

score 0 · Answer 2 · answered Jan 10 '17 at 06:13

0

Delimiter can be set using option() in spark2.0

var df = spark.read.option("header", "true").option("delimiter", "\t").csv("/hdfs/file/locaton")

answered Jan 10 '17 at 06:13

Madhu Kiran Seelam

514
6
8

delimiter or sep specifies the character that delimits columns. I need to specify the string (not character) that delimits lines. By default Hadoop TextInputFormat uses newline (\n). If the textinputformat.record.delimiter Hadoop configuration item is set to say "EOL" then input records will be delimited by the characters EOL and not newline. I have several CSV files to load. Some have embedded newline characters in quoted strings. To have them read correctly I need to specify an alternate record delimiter for Hadoop TextInputFormat. Without this I cannot use Spark CSV input source. – Paul S Jan 12 '17 at 17:23

How to specify Hadoop Configuration when reading CSV

2 Answers2