Aws Athena - Create external table skipping first row

Question

I'm trying to create an external table on csv files with Aws Athena with the code below but the line TBLPROPERTIES ("skip.header.line.count"="1") doesn't work: it doesn't skip the first line (header) of the csv file.

CREATE EXTERNAL TABLE mytable
(
  colA string,
  colB int
  )

 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
 WITH SERDEPROPERTIES (
   'separatorChar' = ',',
   'quoteChar' = '\"',
   'escapeChar' = '\\'
   )
STORED AS TEXTFILE
LOCATION 's3://mybucket/mylocation/'
TBLPROPERTIES (
  "skip.header.line.count"="1")

Any advise?

exist a way to add header to TEXTFILE in create table as cte ? — Cristián Vargas Acevedo, Jun 14 '22 at 18:44

score 54 · Accepted Answer · edited Nov 14 '17 at 09:09

54

Just tried the "skip.header.line.count"="1" and seems to be working fine now.

edited Nov 14 '17 at 09:09

Amit Joshi

15,448
21
77
141

answered Nov 14 '17 at 08:16

Filippo Loddo

966
9
14

Actually it doesn't work anymore. Not sure when this will be really fixed. – Filippo Loddo Dec 14 '17 at 15:19
6

TBLPROPERTIES ('skip.header.line.count'='1') .. worked fine for me – Hussein Al-Far Feb 04 '18 at 12:41
4

it starts to work since 2018-01-19. see https://docs.aws.amazon.com/athena/latest/ug/release-note-2018-01-19.html – shawnzhu Feb 07 '18 at 19:09

score 7 · Answer 2 · answered Aug 27 '19 at 02:23

On the AWS Console you can specify it as Serde parameters key-value keypair

While if you apply your infrastructure as code with terraform you can use ser_de_info parameter - "skip.header.line.count" = 1. Example bellow

resource "aws_glue_catalog_table" "banana_datalake_table" {
  name          = "mapping"
  database_name = "banana_datalake"
  table_type    = "EXTERNAL_TABLE"
    owner = "owner"
    storage_descriptor {
        location        = "s3://banana_bucket/"
        input_format    = "org.apache.hadoop.mapred.TextInputFormat"
        output_format   = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
        compressed  = "false"
        number_of_buckets = -1    
        ser_de_info {
            name    = "SerDeCsv"
            serialization_library = "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
            parameters {
                "field.delim" = ","
                "skip.header.line.count" = 1    # Skip file headers
            }
        }
        columns {
            name    = "column_1"
            type    = "string"
        }
        columns {
            name    = "column_2"
            type    = "string"
        }
        columns {
            name    = "column_3"
            type    = "string"
        }
    }
}

score 5 · Answer 3 · answered Oct 10 '17 at 21:34

5

This is a feature that has not yet been implemented. See Abhishek@AWS' response here:

"We are working on it and will report back as soon as we have an outcome. Sorry for this again. This ended up taking longer than what we anticipated."

My workaround has been to preprocess the data before creating the table:

download the csv file from S3
strip the header using bash sed -e 1d -e 's/\"//g' file.csv > file-2.csv
upload the results to its own folder on S3
create the table

answered Oct 10 '17 at 21:34

Luba Gloukhova

116
2

3

This feature has been available on AWS Athena since 2018-01-19. see https://docs.aws.amazon.com/athena/latest/ug/release-note-2018-01-19.html > Support for ignoring headers. You can use the skip.header.line.count property when defining tables, to allow Athena to ignore headers. – shawnzhu Feb 07 '18 at 19:08
This answer is no longer correct and should be unaccepted as the correct one. skip.header.line.count does work. – Nathan Feger Aug 11 '18 at 00:16

score 5 · Answer 4 · edited Jan 25 '19 at 17:04

5

I recently tried:

TBLPROPERTIES ('skip.header.line.count'='1')

And it works fine now. This issue arose when I had the column header as a string (timestamp) and the records where actual timestamps. My queries would bomb as it would scan the table and find a string instead of timestamp.

Something like this:

ts
2015-06-14 14:45:19.537
2015-06-14 14:50:20.546

edited Jan 25 '19 at 17:04

סטנלי גרונן

2,917
23
46
68

answered Jan 25 '19 at 15:43

Raj

93
1
11

score 1 · Answer 5 · answered Feb 13 '19 at 06:55

When this question was asked there was no support for skipping headers, and when it was later introduced it was only for the OpenCSVSerDe, not for LazySimpleSerDe, which is what you get when you specify ROW FORMAT DELIMITED FIELDS …. I think this is what has caused some confusion about whether or not it works in the answers to this question.

Aws Athena - Create external table skipping first row

5 Answers5

Linked