regex for access log in hive serde with newline

Question

With aws athena services, I try to import csv file including new line data

Importing data uses hive serde format.

If data is like this, (each data is enclosed in double quotes. "")

"DataA"|"DataB"|"DataC"
"Data1"|"Data2
with new line"|"Data3"
"Data가"|"Data2나"|"Data나"

then how to write regular expressions to below table DDL?

CREATE EXTERNAL TABLE ssdm_schema.ABCTable_regex (  
  Data_A VARCHAR(100)  
, Data_B VARCHAR(100)  
, Data_C VARCHAR(100)  
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'  
WITH SERDEPROPERTIES (  
"input.regex" = '?????????'  
)

I'm asking to this question referring to the following answer.

How to handle embed line breaks in AWS Athena

Thank you

I reproduced the problem at https://regex101.com/r/bYF1Zm/1 using regex101 or a similar tool helps others see what you have already tried. — Freiheit, Feb 04 '20 at 02:01
Almost there, https://regex101.com/r/bYF1Zm/2 Just need to handle the unicode next. — Freiheit, Feb 04 '20 at 02:04

score 0 · Answer 1 · answered Feb 04 '20 at 02:09

Solved it. https://regex101.com/r/bYF1Zm/3

"([\w\s\n\r\X]+?)"\|"([\X\w\s\n\r]+?)"\|"([\X\w\s\n\r]+?)" with the global and unicode flags set.

There were three things making this tricky:

The line break in the middle of the data on line 2
The white space in the middle of the data on line 2
The unicode

This regex can probably be more succinct because the matching pattern repeats.

regex for access log in hive serde with newline

1 Answers1