I want to read multiple CSV files from spark but the header is present only in the first file like:
file 1:
id, name
1, A
2, B
3, C
file 2:
4, D
5, E
6, F
PS: I want to use java APIs to do so
I want to read multiple CSV files from spark but the header is present only in the first file like:
file 1:
id, name
1, A
2, B
3, C
file 2:
4, D
5, E
6, F
PS: I want to use java APIs to do so
You can use header=true and use inferSchema=true to get the correct data types from the file you have headers. Then get this schema type into to a StructType in Java and use that to import the other csv files without headers. This was tested in spark version 2.3.2
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
SparkSession spark = SparkSession.builder()
.appName("SimpleApp")
.master("local")
.getOrCreate();
// Use this to get the headers automatically
Dataset<Row> csvData = sparkSession.read()
.format("csv")
.option("header","true")
.option("inferSchema","true")
.load("C:\\MyData\\numData.csv");
csvData.printSchema();
You'll want to do something like the following
Scala solution:
val sqlContext = new SQLContext(sc)
val file1DF = sqlContext
.read
.format("csv")
.option("header", "true")
.load("file1.csv")
val schema = file1.schema
val file2DF = sqlContext
.read
.format("csv")
.schema(schema)
.load("file2.csv")
Java will be similar exception you'll want to use the StructType for the schema.
DataFrame file1DF = sqlContext.read()....;
StructType schema = file1DF.schema();
DataFrame file2DF = sqlContext.read()....schema(schema)....;
Surely you would know the name of the one file that has the header, if not all.
In that case, extending from the existing answer, assuming name of file with the header is h.csv
:
val sqlContext = new SQLContext(sc)
val file1DF = sqlContext
.read
.format("csv")
.option("header", "true")
.load("<path to folder>/h.csv")
val schema = file1.schema
val file2DF = sqlContext
.read
.format("csv")
.schema(schema)
.load("<path to folder>/{[^h],h[^.]}*.csv")
Now, the regex functionality provided is not too exquisite. But, I don't think the Spark DataFrameReader.load public api provides us any powerful way to exclude specific names.
More links to glob pattern that this API accepts are there @ this answer.
The internal API's from hadoop though seem to allow for a filtering (below image from relevant book section linked in above answer):
For your case you may want to name your file with header with a simplistic name, e.g. h.csv
and the regex for second section in above code may have {[^h],h[^.]}*.csv
, to include all files that don't start with h
or if they start with h
then the 2nd character is not .
.