2

I want a regex that will return zero or more occurrences of matched substrings of different patterns.

Different patterns to match

  1. <sym>Any value</sym>

  2. <sps>Any value</sps>

  3. <sbs>Any value</sbs>

  4. Any string including spaces and special characters which are outside of the above 3 tags

Where "Any value" is any string including spaces and special characters.

Test Cases

  1. abcd<sps>2</sps><sbs>yy</sbs>efgh<sym>b</sym>

  2. <sym>nu</sym>Hello World<sps>6&</sps><sbs>10</sbs>With Special Characters$#<sym>b</sym>

  3. <sps>2</sps>Test<sbs>yy</sbs><sym>b</sym>End String

Results

1.

abcd
<sps>2</sps>
<sbs>yy</sbs>  
efgh
<sym>b</sym>
<sym>nu</sym>  
Hello World 
<sps>6&</sps>  
<sbs>10</sbs>
With Special Characters$#
<sym>b</sym>
<sps>2</sps>
Test
<sbs>yy</sbs>
<sym>b</sym>
End String

I tried the following regex:

(?([a-zA-Z0-9]+))<sym>[^.]*</sym>|<sps>[^.]*</sps>|<sbs>[^.]*</sbs>(?([a-zA-Z0-9]+))

Result against "Test Case 1": Getting the following strings where I am not getting the strings outside the tags.

<sps>2</sps> <sbs>yy</sbs> <sym>b</sym>

Result against "Test Case 2": Getting the full input text.

<sym>nu</sym>Hello World<sps>6&</sps><sbs>10</sbs>With Special Characters$#<sym>b</sym>

Could you please help me in this context. Thank you in advanced!

Dhruba
  • 23
  • 6
  • Is the input XML? `6&` is not legal XML, so I’m not entirely clear on the format. – VGR Mar 22 '23 at 20:32
  • Consider using a proper XML parser, see also https://stackoverflow.com/a/1732454/14868997 You could for example use XPath `sym/text()` and `sps/text()` – Charlieface Mar 22 '23 at 21:45
  • Java and C# are different languages. Please choose one and delete the other tag. – AdrianHHH Mar 22 '23 at 22:01
  • @VGR, The input is not XML. I created those tags and the tags can contain anything including alphanumeric/special characters. – Dhruba Mar 23 '23 at 15:31
  • @Charlieface, parsing the XML is not my target. I want all the tags including the start and end tags with contained data. E.g., From test case 2, I expect the following result (separated by comma) nu, 6&, 10, b Along with that the strings which are outside of any tags. Again from test case 2: (separated by comma) Hello World, With Special Characters$# Please let me know if I missed anything to explain. – Dhruba Mar 23 '23 at 15:37
  • @AdrianHHH, I assumed the same regex is compatible with both Java and C#. I need this to use in C#. Although After searching I learned that, Java has all of the C# regex syntaxes but vice versa is not always compatible. But I think in this case the regex might be the same for both languages. Please correct me if I am wrong in this case. – Dhruba Mar 23 '23 at 15:52
  • You just need to add a `` tag around the whole string. Eg https://dotnetfiddle.net/wAtz0Q – Charlieface Mar 23 '23 at 16:10
  • @Charlieface, it's a nice idea to go with XML parser. I reviewed the solution you shared. It's also taking the texts which are inside the tag. Just removing the text which are inside the tags will solve the problem. I am not sure how to remove the texts from the tags. – Dhruba Mar 23 '23 at 16:49
  • https://dotnetfiddle.net/32F65r There are also ways to read XML without having an XML root, you need to use `XmlSerializer` for that I think – Charlieface Mar 23 '23 at 16:54
  • If this is not XML, does your markup notation account for the possibility that the text content might actually contain the six characters ``? – VGR Mar 23 '23 at 17:23
  • @VGR, there can be any number of characters but I will use one or two which can be alphanumeric/special characters. – Dhruba Mar 23 '23 at 20:45

2 Answers2

1

You should use a proper XML parser for this. In C# you can use XElement. Thhis allows you to use Linq-to-XML to query it.

Since your XML doesn't have a root, we need to add one.

var myxml = @"abcd<sps>2</sps><sbs>yy</sbs>efgh<sym>b</sym>";

var doc = XElement.Parse("<Root>" + myxml + "</Root>");
var nodes = doc.Descendants()
            .Where(e => e.Name == "sps" || e.Name == "sbs")
            .Cast<XNode>()
            .Concat(doc.Nodes().OfType<XText>());
Console.WriteLine(string.Join("\r\n", nodes));

dotnetfiddle

Charlieface
  • 52,284
  • 6
  • 19
  • 43
  • Many thanks for the answer. Just tested with all three test cases and seems to fail for the 2nd test case because there is an ampersand (&) inside a tag. Is there anything that ignores those special characters or any other ways? – Dhruba Mar 23 '23 at 20:42
  • No, `&` is a special character in XML, and should be enoded `&`. You can try fix up your data, but it depends what exactly you are trying to achieve with the final result. If you don't do `Console.WriteLine(string.Join("\r\n", nodes.Select(n => n.Value)));` then it will still contain `&` https://dotnetfiddle.net/q6nqZ9 – Charlieface Mar 23 '23 at 20:48
  • It will work. I can update the data format. Thanks a lot for helping me out. I am accepting the answer. However, I know it's quite difficult to achieve using Regex but I was interested if that could be done. – Dhruba Mar 23 '23 at 20:53
  • Ideally you should have valid XML in the first place, so you must have a single root element, and preferably an `` preamble – Charlieface Mar 23 '23 at 20:55
  • Right, if that's a pure XML string. But I tried to use it as a normal string. – Dhruba Mar 23 '23 at 20:59
0

Alternation seems to do what you want. You need a final lookahead to stop the zero occurrence case from overconsuming.

"<sym>.+?<\/sym>|<sps>.+?<\/sps>|<sbs>.+?<\/sbs>|.+?(?=<s(?:ym|ps|bs)>|$)"gm

enter image description here

You can also shorten it with a backreference:

"<(sym|sps|sbs)>.+?<\/\1>|.+?(?=<s(?:ym|ps|bs)>|$)"gm

In this case the alternation is shortened to just the tag value, and the closing tag can reuse it with the backreference \1.

Chris Maurer
  • 2,339
  • 1
  • 9
  • 8
  • Hi @Chris, thanks for the answer. Unfortunately, with the regex you mentioned, I only can find two matches for test case 1: (results are separated by comma) 2, yy And still missing the other strings (Please have a look into the results 1). – Dhruba Mar 23 '23 at 15:59