Regex to find zero or more occurance of substrings of different patterns

Question

I want a regex that will return zero or more occurrences of matched substrings of different patterns.

Different patterns to match

<sym>Any value</sym>
<sps>Any value</sps>
<sbs>Any value</sbs>
Any string including spaces and special characters which are outside of the above 3 tags

Where "Any value" is any string including spaces and special characters.

Test Cases

abcd<sps>2</sps><sbs>yy</sbs>efgh<sym>b</sym>
<sym>nu</sym>Hello World<sps>6&</sps><sbs>10</sbs>With Special Characters$#<sym>b</sym>
<sps>2</sps>Test<sbs>yy</sbs><sym>b</sym>End String

Results

1.

abcd
<sps>2</sps>
<sbs>yy</sbs>  
efgh
<sym>b</sym>

<sym>nu</sym>  
Hello World 
<sps>6&</sps>  
<sbs>10</sbs>
With Special Characters$#
<sym>b</sym>

<sps>2</sps>
Test
<sbs>yy</sbs>
<sym>b</sym>
End String

I tried the following regex:

(?([a-zA-Z0-9]+))<sym>[^.]*</sym>|<sps>[^.]*</sps>|<sbs>[^.]*</sbs>(?([a-zA-Z0-9]+))

Result against "Test Case 1": Getting the following strings where I am not getting the strings outside the tags.

<sps>2</sps> <sbs>yy</sbs> <sym>b</sym>

Result against "Test Case 2": Getting the full input text.

<sym>nu</sym>Hello World<sps>6&</sps><sbs>10</sbs>With Special Characters$#<sym>b</sym>

Could you please help me in this context. Thank you in advanced!

Is the input XML? `6&` is not legal XML, so I’m not entirely clear on the format. — VGR, Mar 22 '23 at 20:32
Consider using a proper XML parser, see also https://stackoverflow.com/a/1732454/14868997 You could for example use XPath `sym/text()` and `sps/text()` — Charlieface, Mar 22 '23 at 21:45
Java and C# are different languages. Please choose one and delete the other tag. — AdrianHHH, Mar 22 '23 at 22:01
@VGR, The input is not XML. I created those tags and the tags can contain anything including alphanumeric/special characters. — Dhruba, Mar 23 '23 at 15:31
@Charlieface, parsing the XML is not my target. I want all the tags including the start and end tags with contained data. E.g., From test case 2, I expect the following result (separated by comma) nu, 6&, 10, b Along with that the strings which are outside of any tags. Again from test case 2: (separated by comma) Hello World, With Special Characters$# Please let me know if I missed anything to explain. — Dhruba, Mar 23 '23 at 15:37
@AdrianHHH, I assumed the same regex is compatible with both Java and C#. I need this to use in C#. Although After searching I learned that, Java has all of the C# regex syntaxes but vice versa is not always compatible. But I think in this case the regex might be the same for both languages. Please correct me if I am wrong in this case. — Dhruba, Mar 23 '23 at 15:52
You just need to add a `` tag around the whole string. Eg https://dotnetfiddle.net/wAtz0Q — Charlieface, Mar 23 '23 at 16:10
@Charlieface, it's a nice idea to go with XML parser. I reviewed the solution you shared. It's also taking the texts which are inside the tag. Just removing the text which are inside the tags will solve the problem. I am not sure how to remove the texts from the tags. — Dhruba, Mar 23 '23 at 16:49
https://dotnetfiddle.net/32F65r There are also ways to read XML without having an XML root, you need to use `XmlSerializer` for that I think — Charlieface, Mar 23 '23 at 16:54
If this is not XML, does your markup notation account for the possibility that the text content might actually contain the six characters ``? — VGR, Mar 23 '23 at 17:23
@VGR, there can be any number of characters but I will use one or two which can be alphanumeric/special characters. — Dhruba, Mar 23 '23 at 20:45

score 1 · Accepted Answer · answered Mar 23 '23 at 17:16

1

You should use a proper XML parser for this. In C# you can use XElement. Thhis allows you to use Linq-to-XML to query it.

Since your XML doesn't have a root, we need to add one.

var myxml = @"abcd<sps>2</sps><sbs>yy</sbs>efgh<sym>b</sym>";

var doc = XElement.Parse("<Root>" + myxml + "</Root>");
var nodes = doc.Descendants()
            .Where(e => e.Name == "sps" || e.Name == "sbs")
            .Cast<XNode>()
            .Concat(doc.Nodes().OfType<XText>());
Console.WriteLine(string.Join("\r\n", nodes));

dotnetfiddle

answered Mar 23 '23 at 17:16

Charlieface

52,284
6
19
43

Many thanks for the answer. Just tested with all three test cases and seems to fail for the 2nd test case because there is an ampersand (&) inside a tag. Is there anything that ignores those special characters or any other ways? – Dhruba Mar 23 '23 at 20:42
No, `&` is a special character in XML, and should be enoded `&`. You can try fix up your data, but it depends what exactly you are trying to achieve with the final result. If you don't do `Console.WriteLine(string.Join("\r\n", nodes.Select(n => n.Value)));` then it will still contain `&` https://dotnetfiddle.net/q6nqZ9 – Charlieface Mar 23 '23 at 20:48
It will work. I can update the data format. Thanks a lot for helping me out. I am accepting the answer. However, I know it's quite difficult to achieve using Regex but I was interested if that could be done. – Dhruba Mar 23 '23 at 20:53
Ideally you should have valid XML in the first place, so you must have a single root element, and preferably an `` preamble – Charlieface Mar 23 '23 at 20:55
Right, if that's a pure XML string. But I tried to use it as a normal string. – Dhruba Mar 23 '23 at 20:59

Chris Maurer · Answer 2 · 2023-03-22T21:00:47.213

0

Alternation seems to do what you want. You need a final lookahead to stop the zero occurrence case from overconsuming.

"<sym>.+?<\/sym>|<sps>.+?<\/sps>|<sbs>.+?<\/sbs>|.+?(?=<s(?:ym|ps|bs)>|$)"gm

You can also shorten it with a backreference:

"<(sym|sps|sbs)>.+?<\/\1>|.+?(?=<s(?:ym|ps|bs)>|$)"gm

In this case the alternation is shortened to just the tag value, and the closing tag can reuse it with the backreference \1.

edited Mar 22 '23 at 21:00

answered Mar 22 '23 at 20:44

Chris Maurer

2,339
1
9
8

Hi @Chris, thanks for the answer. Unfortunately, with the regex you mentioned, I only can find two matches for test case 1: (results are separated by comma) 2, yy And still missing the other strings (Please have a look into the results 1). – Dhruba Mar 23 '23 at 15:59

Regex to find zero or more occurance of substrings of different patterns

2 Answers2