12

I have created a table in AWS Athena like this:

CREATE EXTERNAL TABLE IF NOT EXISTS default.test_line_breaks (
  col1 string, 
  col2 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
 'separatorChar' = ',',
 'quoteChar' = '\"',
 'escapeChar' = '\\'
)
STORED AS TEXTFILE
LOCATION 's3://bucket/test/'

In the bucket I put a simple CSV file with the following context:

rec1 col1,rec2 col2
rec2 col1,"rec2, col2"
rec3 col1,"rec3
col2"

When I run data preview request SELECT * FROM "default"."test_line_breaks" limit 10; then Athena returns the following response: Athena query response

How should I set ROW FORMAT to properly handle line breaks within the field values? So that rec3\ncol2 appears in col2.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
antklim
  • 381
  • 3
  • 7

1 Answers1

12

The problem here is that the OpenCSV Serializer-Deserializer

Does not support embedded line breaks in CSV files.

See this documentation from AWS.

However, it might be possible to use RegexSerDe. Just remember that this Deserializer will take "Java Flavored" Regex. So be sure to use an online Regex tool that supports that syntax in your debugging.

Edit: Still working on the syntax for dealing with the embedded line feed \n. However, here is a sample that handles two columns with optional quotes. The following regex "*([^"]*)"*,"*([^"]*)"* worked on your line with the embedded return carriage. However, I think the Presto Engine is only feeding it rec3 col1,"rec3. I continue working on it.

CREATE EXTERNAL TABLE IF NOT EXISTS default.test_line_breaks (
  col1 string, 
  col2 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = '"*([^"]*)"*,"*([^"]*)"*'
)
STORED AS TEXTFILE
LOCATION 's3://.../47936191';
Zerodf
  • 2,208
  • 18
  • 26
  • 1
    Hi @Zerodf, thanks for your help and explanations. I think I would pre-process file prior to upload to Athena. – antklim Dec 27 '17 at 00:03
  • 1
    @antklim, that might be a good idea. Often when working these kind of "big data" problems, I have found it best practice to fix the upstream tasks to ensure consistent output, rather than compensate for inconsistent output downstream. – Zerodf Dec 27 '17 at 02:58