1

With aws athena services, I try to import csv file including new line data

Importing data uses hive serde format.

If data is like this, (each data is enclosed in double quotes. "")

"DataA"|"DataB"|"DataC"
"Data1"|"Data2
with new line"|"Data3"
"Data가"|"Data2나"|"Data나"

then how to write regular expressions to below table DDL?

CREATE EXTERNAL TABLE ssdm_schema.ABCTable_regex (  
  Data_A VARCHAR(100)  
, Data_B VARCHAR(100)  
, Data_C VARCHAR(100)  
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'  
WITH SERDEPROPERTIES (  
"input.regex" = '?????????'  
) 

I'm asking to this question referring to the following answer.

How to handle embed line breaks in AWS Athena

Thank you

dtolnay
  • 9,621
  • 5
  • 41
  • 62
  • I reproduced the problem at https://regex101.com/r/bYF1Zm/1 using regex101 or a similar tool helps others see what you have already tried. – Freiheit Feb 04 '20 at 02:01
  • Almost there, https://regex101.com/r/bYF1Zm/2 Just need to handle the unicode next. – Freiheit Feb 04 '20 at 02:04

1 Answers1

0

Solved it. https://regex101.com/r/bYF1Zm/3

"([\w\s\n\r\X]+?)"\|"([\X\w\s\n\r]+?)"\|"([\X\w\s\n\r]+?)" with the global and unicode flags set.

There were three things making this tricky:

  1. The line break in the middle of the data on line 2
  2. The white space in the middle of the data on line 2
  3. The unicode

This regex can probably be more succinct because the matching pattern repeats.

Freiheit
  • 8,408
  • 6
  • 59
  • 101