1

I am trying to split a string using regex. I need to use regex in nifi to split a string into groups. Could anyone helps me how to split below string using regex.

I have a string like this:

"abc","-9223371901096288826","/home/test/20170614","abc.com","Hello,Test","7462200","4622012","1296614","1029293","893529","a:ce:o:5:l:p:MMM dd HH:mm:ss","Logs","UTF8","<111>Jun 14 12:43:20 logs: Info: 1497462198.717 13073 1.22.333.44 TCP/200 168 TCP_CONNECT 1.22.33.44:443 ""GO\ABC.COM"" DIRECT/img.abc.com - test_abc_7-DefaultGroup-DefaultGroup-NONE-NONE-NONE-DefaultGroup <IW_adv,3.9,-,""-"",-,-,-,-,""-"",-,-,-,""-"",-,-,""-"",""-"",-,-,IW_adv,-,""-"",""-"",""Unknown"",""Unknown"",""-"",""-"",0.10,0,-,""-"",""-"",-,""-"",-,-,""-"",""-"",-,-,""-""> - -"


I want to split by commas but I need to ignore commas in quotes. I want result something like this :

    group 1 - abc
    group 2 - -9223371901096288826
    group 3 - /home/test/20170614
    group 4 - abc.com
    group 5 - Hello,Test
    group 6 - 7462200
    group 7 - 4622012
    group 8 - 1296614
    group 9 - 1029293
    group 10 - 893529
    group 11 - a:ce:o:5:l:p:MMM dd HH:mm:ss
    group 12 - Logs
    group 13 - UTF8
    group 14 - <111>Jun 14 12:43:20 logs: Info: 1497462198.717 13073 1.22.333.44 TCP/200 168 TCP_CONNECT 1.22.33.44:443 ""GO\ABC.COM"" DIRECT/img.abc.com - test_abc_7-DefaultGroup-DefaultGroup-NONE-NONE-NONE-DefaultGroup <IW_adv,3.9,-,""-"",-,-,-,-,""-"",-,-,-,""-"",-,-,""-"",""-"",-,-,IW_adv,-,""-"",""-"",""Unknown"",""Unknown"",""-"",""-"",0.10,0,-,""-"",""-"",-,""-"",-,-,""-"",""-"",-,-,""-""> - -


I tried so many regex to split but unable to get proper result.

I tried ,(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$) regex found from this link.

Above regex works great in Java for split() function but I don't want to use in Java.

I tried (?<=\")([^,]*)(?=\") regex and split the string in groups by commas but it also split inside double quotes also.

Could anyone help me. Thanks in Advance.

ankit
  • 380
  • 4
  • 16
  • what's your OS? I can suggest Unix-based solution – RomanPerekhrest Jun 22 '17 at 07:26
  • 1
    Not sure if NiFi provides an option of getting a list of captures but if it does, you may, instead of splitting, match all quoted values taking into account escaped quotes with [`"((?:""|[^"])*)"`](https://regex101.com/r/y5futn/2) – Dmitry Egorov Jun 22 '17 at 07:26
  • Thanks Dmitry. Your suggested expression worked. – ankit Jun 22 '17 at 07:36
  • https://stackoverflow.com/questions/11456850/split-a-string-by-commas-but-ignore-commas-within-double-quotes-using-javascript – Ankit Kumar Jun 22 '17 at 07:36

1 Answers1

4

you can get your requirement without capturing groups by using following way.

Let us consider your below string.,

1.Use UpdateAttribute for store whole String in attribute named "InputString".

"abc","-9223371901096288826","/home/test/20170614","abc.com","Hello,Test","7462200","4622012","1296614","1029293","893529","a:ce:o:5:l:p:MMM dd HH:mm:ss","Logs","UTF8","<111>Jun 14 12:43:20 logs: Info: 1497462198.717 13073 1.22.333.44 TCP/200 168 TCP_CONNECT 1.22.33.44:443 ""GO\ABC.COM"" DIRECT/img.abc.com - test_abc_7-DefaultGroup-DefaultGroup-NONE-NONE-NONE-DefaultGroup <IW_adv,3.9,-,""-"",-,-,-,-,""-"",-,-,-,""-"",-,-,""-"",""-"",-,-,IW_adv,-,""-"",""-"",""Unknown"",""Unknown"",""-"",""-"",0.10,0,-,""-"",""-"",-,""-"",-,-,""-"",""-"",-,-,""-""> - -"

2.After result of the updateAttribute you can use another update attribute to extract those values like below..,

group1:${InputString:getDelimitedField(1)}
group2:${InputString:getDelimitedField(2)}
group3:${InputString:getDelimitedField(3)}
group4:${InputString:getDelimitedField(4)}
group5:${InputString:getDelimitedField(5)}
group6:${InputString:getDelimitedField(6)}
group7:${InputString:getDelimitedField(7)}
group8:${InputString:getDelimitedField(8)}
group9:${InputString:getDelimitedField(9)}
group10:${InputString:getDelimitedField(10)}
group11:${InputString:getDelimitedField(11)}
group12:${InputString:getDelimitedField(12)}
group13:${InputString:getDelimitedField(13)}

You can use getDelimitedFunction is the easiest way to extract those values with below reference

https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#getdelimitedfield

let me know if you face any issues in it.

Mister X
  • 3,406
  • 3
  • 31
  • 72