0

I need to construct a Hive SerDe RegEx for pipe delimited data.

Sample data:

CEF:0|Microsoft|Microsoft Windows||Microsoft-Windows-Security-Auditing:434|An account was logged off.|Low| eventId=260 externalId=44 msg=Network: A user or computer logged on to this computer from the network. categorySignificance=/Informational categoryBehavior=/Access/Stop categoryDeviceGroup=/Operating System catdt=Operating System categoryOutcome=/Success categoryObject=/Host/Operating|Vista ad.EventIndex=-972 ad.WindowsParserFamily=Windows 2008 R2|2008|7|Vista ad.WindowsVersion=Windows Server

For this we need to separate out first seven columns by pipe and consider everything after that as a single column.

DDL: (CEF STRING, Vendor STRING, Product STRING, Version STRING, Signature STRING, Name STRING, Severity STRING, Extension STRING)

So Sample data output should be mapped to columns as follows: Col1: CEF:0 Col2: Microsoft Col3: Microsoft Windows Col4: Col5: Microsoft-Windows-Security-Auditing:434 Col6: An account was logged off. Col7: Low Col8: eventId=260 externalId=44 msg=Network: A user or computer logged on to this computer from the network. categorySignificance=/Informational categoryBehavior=/Access/Stop categoryDeviceGroup=/Operating System catdt=Operating System categoryOutcome=/Success categoryObject=/Host/Operating|Vista ad.EventIndex=-972 ad.WindowsParserFamily=Windows 2008 R2|2008|7|Vista ad.WindowsVersion=Windows Server

What should be the input.regex?

Also is it possible to have a Map data type for the columns in (key=value) format using this Regex?

Sourabh Potnis
  • 1,431
  • 1
  • 17
  • 26

1 Answers1

0

I have no expirience with hive, but looking at some examples the following value for input.regex shoud work:

([^\\|]*)\\|([^\\|]*)\\|([^\\|]*)\\|([^\\|]*)\\|([^\\|]*)\\|([^\\|]*)\\|([^\\|]*)\\|(.*)

You might need to configure an output.format.string. Maybe the following links help:

Community
  • 1
  • 1
Peter
  • 3,916
  • 1
  • 22
  • 43
  • great answer but just as a note for those reading your answer: the OP doesn't mention but CEF allows for escaped pipes \| within the pipe delimited section of the header (how smart is that?) and your Regex doesn't seem to cater for that... e.g. Sep 19 08:26:10 host CEF:0|security|threatmanager|1.0|100|detected a \| in message|10|src=10.0.0.1 act=blocked a | dst=1.1.1.1 – Andre de Miranda Mar 23 '15 at 13:20
  • I also created a CEF parser in Java so that people can no longer suffer with parsing this format from hell. :-) https://github.com/fluenda/ParCEFone – Andre de Miranda Aug 02 '16 at 12:38