Spark read multiple CSV file with header only in first file

Question

I want to read multiple CSV files from spark but the header is present only in the first file like:

file 1:

id, name
1, A
2, B
3, C

file 2:

4, D
5, E
6, F

PS: I want to use java APIs to do so

Eranga Atugoda · Answer 1 · 2018-11-06T06:18:47.857

You can use header=true and use inferSchema=true to get the correct data types from the file you have headers. Then get this schema type into to a StructType in Java and use that to import the other csv files without headers. This was tested in spark version 2.3.2

    import org.apache.spark.sql.Dataset;
    import org.apache.spark.sql.Row;
    import org.apache.spark.sql.SparkSession;


    SparkSession spark = SparkSession.builder()
        .appName("SimpleApp")
        .master("local")
        .getOrCreate();

  // Use this to get the headers automatically
    Dataset<Row> csvData = sparkSession.read()
       .format("csv")
       .option("header","true")
       .option("inferSchema","true")
       .load("C:\\MyData\\numData.csv");

    csvData.printSchema();

Richard Tran · Answer 2 · 2018-04-10T23:43:24.723

1

You'll want to do something like the following

Scala solution:

val sqlContext = new SQLContext(sc)

val file1DF = sqlContext
  .read
  .format("csv")
  .option("header", "true")
  .load("file1.csv")

val schema = file1.schema

val file2DF = sqlContext
  .read
  .format("csv")
  .schema(schema)
  .load("file2.csv")

Java will be similar exception you'll want to use the StructType for the schema.

DataFrame file1DF = sqlContext.read()....;
StructType schema = file1DF.schema();
DataFrame file2DF = sqlContext.read()....schema(schema)....;

edited Apr 10 '18 at 23:43

answered Apr 09 '18 at 08:00

Richard Tran

438
3
10

That would surely work but I'm looking for a generic solution that works for all files in a path. Also, I don't know the number of files in the path. So it won't be possible to read file individually. – Sandeep Apr 09 '18 at 09:14

score 1 · Answer 3 · answered Apr 09 '18 at 11:09

Surely you would know the name of the one file that has the header, if not all. In that case, extending from the existing answer, assuming name of file with the header is h.csv :

val sqlContext = new SQLContext(sc)

val file1DF = sqlContext
  .read
  .format("csv")
  .option("header", "true")
  .load("<path to folder>/h.csv")

val schema = file1.schema

val file2DF = sqlContext
  .read
  .format("csv")
  .schema(schema)
  .load("<path to folder>/{[^h],h[^.]}*.csv")

Now, the regex functionality provided is not too exquisite. But, I don't think the Spark DataFrameReader.load public api provides us any powerful way to exclude specific names.

More links to glob pattern that this API accepts are there @ this answer.

The internal API's from hadoop though seem to allow for a filtering (below image from relevant book section linked in above answer):

For your case you may want to name your file with header with a simplistic name, e.g. h.csv and the regex for second section in above code may have {[^h],h[^.]}*.csv, to include all files that don't start with h or if they start with h then the 2nd character is not ..

Spark read multiple CSV file with header only in first file

3 Answers3