Spark SQL : HiveContext don't ignore header

Question

I have a spark Job that read data from an External Hive Table and do some transformation and re-save data in another internal Hive Table

val sparkConf = new SparkConf().setAppName("Bulk Merge Daily Load Job")
val sparkContext = new SparkContext(sparkConf)
val sqlContext = new HiveContext(sparkContext)

// Data Ingestion
val my_df = sqlContext.sql("select * from test")

// Transformation 
...
...

// Save Data into Hive
my_df.write.format("orc")
.option("orc.compress","SNAPPY")
.mode(SaveMode.Overwrite)
.saveAsTable("my_internal_table")

The external Table is created with the this tblproperties line :

tblproperties ("skip.header.line.count"="1");

My problem is that i found in my rows in the my_internal_table Table an additional line representing the columns name .

I guess this is related to this issue :

I am using spark 1.6.0

Can you help me on this :

Is this bug still occuring in 1.6.0 ?
Is there any simple way to avoid this ?

PS : I am processing large file > 10Go .

Thanks in advance for your response.

That JIRA is still "unresolved" and it's apparent that nobody is working on it, so why do you ask? Just find a workaround... — Samson Scharfrichter, Jul 25 '16 at 17:44

score 2 · Answer 1 · answered Feb 12 '17 at 03:06

I ran into the same issue, but if you save the same table as ORC, it should work. Just create a new table with the same schema as your original one, but set the format to ORC. Then backfill the data from original table into the ORC one.

When you read the ORC table from Spark, it should not bring in the header row.

Hope that helps!

score 1 · Answer 2 · answered Jan 11 '19 at 22:27

I have a work-around for this limitation, though it is little expensive, but can embed.

scala> val dfData1 = spark.sql("select * from db.mytable")
scala> dfData1.show()
+---+----+
| id|name|
+---+----+
| id|name|
|  1| Sam|
|  2| Pam|
| id|name|
|  3| Jim|
+---+----+
scala> val fileHeader = dfData1.head.getString(0)
fileHeader: String = id

scala> val dfRealData = dfData1.filter(col(dfData1.columns(0)) =!= fileHeader)
dfRealData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: string, name: string]

scala> dfRealData.show()
+---+----+
| id|name|
+---+----+
|  1| Sam|
|  2| Pam|
|  3| Jim|
+---+----+

Hope it helps to somebody.

score 0 · Accepted Answer · edited May 23 '17 at 10:31

Possible workarounds:

connect to HiveServer2 via JDBC and let Hive do the de-serialization -- could be messy to get the CLASSPATH right, could be hell to deal with the data volume w/o parallelization out of the box...
read the CSV directly, using the "old school" CSV parser, and find a way to skip the header (cf. that question for example) -- old school, yeah
read the CSV directly, using the Spark-CSV plugin with option "header" -- clearly that's the way to go (but mind the dependency on Apache commons-csv.jar that should be made explicit in the documentation...)

Note that when hitting the CSV file(s) directly, you lose the benefit of having the Hive Metastore hide the actual HDFS directory (or directories, in case the table is partitioned) -- so you are back to hard-coding paths in your code.

Or, if you want a more portable solution, you may use the Hive Java API to connect to the MetaStore, navigate to the StorageDescriptor for the table you want (and possibly its partitions), get the (list of) HDFS directory(ies), then hit the files therein. Shall you enter that Quest, young paladin...? B-)

Spark SQL : HiveContext don't ignore header

3 Answers3