19

Below is the hive table i have created:

CREATE EXTERNAL TABLE Activity (
  column1 type, </br>
  column2 type
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/exttable/';

In my HDFS location /exttable, i have lot of CSV files and each CSV file also contain the header row. When i am doing select queries, the result contains the header row as well.

Is there any way in HIVE where we can ignore the header row or first line ?

vsminkov
  • 10,912
  • 2
  • 38
  • 50
  • 1
    Possible duplicate of [this question](http://stackoverflow.com/questions/15751999/hive-external-table-skip-first-row/15753145#15753145). Basically same answer there as here with some more suggestions. @kgu87's solution works too. – Daniel Koverman May 10 '13 at 14:48
  • Does this answer your question? [How to skip CSV header in Hive External Table?](https://stackoverflow.com/questions/15751999/how-to-skip-csv-header-in-hive-external-table) – Jacek Laskowski Jun 29 '21 at 08:39

6 Answers6

27

you can now skip the header count in hive 0.13.0.


tblproperties ("skip.header.line.count"="1");
tariqz
  • 271
  • 3
  • 2
  • 2
    Note that this can be done to an existing table as well using ALTER TABLE. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTableProperties – Mike Nov 08 '16 at 15:56
25

If you are using Hive version 0.13.0 or higher you can specify "skip.header.line.count"="1" in your table properties to remove the header.

For detailed information on the patch see: https://issues.apache.org/jira/browse/HIVE-5795

Climbs_lika_Spyder
  • 6,004
  • 3
  • 39
  • 53
Varun Gupta
  • 1,419
  • 6
  • 28
  • 53
7

Lets say you want to load csv file like below located at /home/test/que.csv

1,TAP (PORTUGAL),AIRLINE
2,ANSA INTERNATIONAL,AUTO RENTAL
3,CARLTON HOTELS,HOTEL-MOTEL

Now, we need to create a location in HDFS that holds this data.

hadoop fs -put /home/test/que.csv /user/mcc

Next step is to create a table. There are two types of them to choose from. Refer this for choosing one.

Example for External Table.

create external table industry_ 
(
MCC string ,
MCC_Name string,
MCC_Group string
)       
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/mcc/'
tblproperties ("skip.header.line.count"="1");

Note: When accessed via Spark SQL, the header row of the CSV will be shown as a data row. Tested on: spark version 2.4.

s_mj
  • 530
  • 11
  • 28
3

There is not. However, you can pre-process your files to skip the first row before loading into HDFS -

tail -n +2 withfirstrow.csv > withoutfirstrow.csv

Alternatively, you can build it into where clause in HIVE to ignore the first row.

kgu87
  • 2,050
  • 14
  • 12
  • How do you write a where clause in HIVE to ignore the first row? – KalEl Jan 05 '15 at 15:40
  • 1
    If the data contains header row, then you can use where col <> 'header column name' in HQL, assuming the data itself does not contain that value. A better option for Hive 0.13.0 or higher would be to use skip.header.line.count when defining the table. – kgu87 Jan 07 '15 at 13:49
  • I used the hive tblproperties `("skip.header.line.count"="1");` property. But I am still able to see the header in hdfs table location. How can i remove the header line from being inserted into hdfs – Neethu Lalitha Apr 03 '15 at 15:18
0

If your hive version doesn't support tblproperties ("skip.header.line.count"="1"), you can use below unix command to ignore the first line (column header) and then put it in HDFS.

 sed -n '2,$p' File_with_header.csv > File_with_No_header.csv
sras
  • 818
  • 7
  • 18
-2

To remove the header from the csv file in place use:

sed -i 1d filename.csv 
Buddy
  • 10,874
  • 5
  • 41
  • 58