1

I have a csv file with contents as below which has a header in the 1st line .

id,name
1234,Rodney
8984,catherine

Now I was able create a table in hive to skip header and read the data appropriately. Table in Hive

CREATE EXTERNAL TABLE table_id(
  `tmp_id` string, 
  `tmp_name` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
WITH SERDEPROPERTIES ( 
  'field.delim'=',', 
  'serialization.format'=',') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION  's3://some-testing/test/data/'
tblproperties ("skip.header.line.count"="1");

Results in Hive

select * from table_id;
OK
1234    Rodney
8984    catherine
Time taken: 1.219 seconds, Fetched: 2 row(s)

But, when I use the same table in pyspark (Ran the same query) I see even the headers from file in pyspark results as below.

>>> spark.sql("select * from table_id").show(10,False)
+------+---------+                                                              
|tmp_id|tmp_name |
+------+---------+
|id    |name     |
|1234  |Rodney   |
|8984  |catherine|
+------+---------+

Now, how can I ignore these showing up in the results in pyspark. I'm aware that we can read the csv file and add .option("header",True) to achieve this but, I wanna know if there's a way to do something similar in pyspark while querying tables.

Can someone suggest me a way.... Thanks in Advance !!

Bjay
  • 59
  • 7

2 Answers2

2

u can use below two properties: serdies properties and table properties, you will be able to access table from hive and spark by skipping header in both env.

CREATE EXTERNAL TABLE `student_test_score_1`(
 student string,
 age string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'delimiter'=',',
  'field.delim'=',',
  'header'='true',
  'skip.header.line.count'='1',
  'path'='hdfs:<path>')
  LOCATION
  'hdfs:<path>'
  TBLPROPERTIES (
  'spark.sql.sources.provider'='CSV')
Benjamin Breton
  • 1,388
  • 1
  • 13
  • 42
0

This is know issue in Spark-11374 and closed as won't fix.

In query you can have where clause to select all records except 'id' and 'name'.

spark.sql("select * from table_id where tmp_id <> 'id' and tmp_name <> 'name'").show(10,False)
#or
spark.sql("select * from table_id where tmp_id != 'id' and tmp_name != 'name'").show(10,False)

Another way would be using reading files from HDFS with .option("header","true").

notNull
  • 30,258
  • 4
  • 35
  • 50
  • Thanks for the response Shu. Well I can definitely add an additional filter to the query but, I wanna avoid additional filters [where tmp_id <> 'id' and tmp_name <> 'name'] in the query. – Bjay Jul 07 '20 at 22:10
  • @Bjay, Another way I can think of is Read from hive table then `convert to rdd` filter out the rows then `convert back to dataframe`. But here we are doing lot of conversions (not recommended). https://stackoverflow.com/questions/27854919/how-do-i-skip-a-header-from-csv-files-in-spark – notNull Jul 07 '20 at 22:30
  • Yeah too many conversations in this approach. In case of huge volume of realtime data (~18TB) in my case, I guess it’s gonna consume quite a lot time and computation. Rather filters mentioned above might be a good fit. – Bjay Jul 07 '20 at 23:21