4

I am using the below regex(tab seperated) to parse the data(also tab seperated) which is given.

Hive table creation syntax:

 create table akmlogreg(logdate string, time string, clientip string, method string, uri string, status string, bytes string, TimeTakenMS string, referer string, useragent string, cs_Cookie string) ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" ="([0-9-]+)   ([^\t]*)    ([^\t]*)    ([^\t]*)    ([^\t]*)    ([^\t]*)    ([^\t]*)    ([^\t]*)    (\".*\"|[^ ]*)  (\".*\"|[^ ]*)  ([^\r\n]+)",
"output.format.string"="%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s");

through this regex i want any comments (lines starting with #) should be removed and select only one row at a time. But this syntax gives error when i try to create the table in hive. My logic behind tab separated regex is that my log data is also tab separated. Can anyone give me a better suggesion or solution by which i can parse this kind of data which is tab seperated using regex?

Exception:

FAILED: Error in metadata: java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 10
([0-9-]+)]+)
          ^
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

Data:

#Version: 1.0
#Fields: date time cs-ip cs-method cs-uri sc-status sc-bytes time-taken cs(Referer) cs(User-Agent) cs(Cookie)
2013-07-02  00:00:00    242.242.242.242 GET /9699/14916.jpg 200 6783    0   "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.23 Safari/534.10"    "-"
2013-07-02  00:00:00    242.242.242.242 GET /169875/2006-2010-679336-640x428.JPG    200 78221   355 "-" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.52 Safari/537.36" "-"
2013-07-02  00:00:00    242.242.242.242 GET /169875/2006-2010-679339-640x428.JPG    200 86791   238 "-" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.52 Safari/537.36" "-"
Naresh
  • 5,073
  • 12
  • 67
  • 124

1 Answers1

3

Try this:

^([0-9-]+)\t([^\t]*)\t([^\t]*)\t([^\t]*)\t([^\t]*)\t([^\t]*)\t([^\t]*)\t([^\t]*)\t(\".*?\"|[^ ]*)\t(\".*?\"|[^ ]*)\t([^\r\n]+)$

Regular expression image

Community
  • 1
  • 1
Stephan
  • 41,764
  • 65
  • 238
  • 329
  • 1
    Thanks Alex. It worked. By the way do you have any idea why hive doesn't take tab seprated regex? – Naresh Jul 08 '13 at 09:53
  • 1
    I can't say for hive... but as a general rule of thumb, when writing regular expressions, I always specify explicitly the characters I want to match; no matter the regexp flavour in use. – Stephan Jul 08 '13 at 10:30