Regex, select Nth match

Question

I have a file that contains this:

<Row>
<Cell><Data ss:Type="String">INC000012486615</Data></Cell>
<Cell><Data ss:Type="String">abcd-efg-hij4-en:ddcs</Data></Cell>
<Cell><Data ss:Type="String">fs-hubraum-apps:kayw-de</Data></Cell>
<Cell><Data ss:Type="String">mn-def-seb01:sfyc-en</Data></Cell>
<Cell><Data ss:Type="String">00055s4dEN</Data></Cell>
<Cell><Data ss:Type="String"></Data></Cell>
<Cell><Data ss:Type="String">General Information</Data></Cell>
<Cell ss:StyleID="ce2"><Data  ss:Type="DateTime">2017-06-28T16:24:35</Data>
</Cell><Cell><Data ss:Type="String">Public</Data></Cell>
<Cell><Data ss:Type="String">Hi John,
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Thanks,
Snow</Data></Cell>  
</Row>

I wrote a Regex that selects the valuable information: (?<=<Data[^>]*>)((.|\n)*?)(?=<\/Data>): only selects the data inside inside <Cell><Data>. You can test on this link

I would like to be able to select the nth match using Regex: (1st match: INC000012486615, second match abcd-efg-hij4-en:ddcs, etc.)

I wasn't successful modifying my Regex. Any suggestions ?

PS: I have to use Regex. For Splunk Field extraction.

[**Do not use regex to parse XML**](https://stackoverflow.com/a/1732454/1954610). Use a parser. — Tom Lord, Jul 20 '17 at 09:50
Hello. I have to use Regex inside Splunk field extractor. Also the file is not well structured xml. So I cannot use xml parser. — belas, Jul 20 '17 at 10:01
What do you mean by "not well structured"? The example you provided seems fine. I'm not familiar with `splunk`, but a quick google tells me you can use `spath` to parse the XML? http://docs.splunk.com/Documentation/Splunk/6.0/SearchReference/Spath — Tom Lord, Jul 20 '17 at 14:23
I only put a data snippet to focus on the question problem. The data source is a log file including XML. It is not an XML file, thus cannot import or parse. — belas, Jul 20 '17 at 16:10

score 2 · Accepted Answer · answered Jul 20 '17 at 12:20

2

Try if this pattern returns the 3th Data value:

<Row>(?:\s*(?:<\/Data>\s*<\/Cell>\s*)?<Cell[^<>]*>\s*<Data\b[^<>]+>\K([^<>]*)){3}

The \K is used to ensure that the pattern before it isn't part of the matched characters.

answered Jul 20 '17 at 12:20

LukStorms

28,916
5
31
45

score 1 · Answer 2 · answered Jul 31 '17 at 12:13

This is the wrong approach. Rather than writing a sloppy regular expression to capture all the values, it would be better to enable kv_mode in your props.conf

If your in a clustered envrionment, go to your cluster master and edit props.conf to KV_MODE = xml

In a non-clustered environment, go to your indexer(s) and add the KV_MODE attribute

Regex, select Nth match

2 Answers2