0

I want to add a <span></span> to each of the tag in the following XML. I would like to use C# regular expression like this.

Regex.Replace(xml, @"<*>", @"<span>" + @"<*>" + "</span>")

Original XML:

<div id="Content">
  <p>1</p>
  <h2>1</h2>
  <h2>2</h2>
</div>

Modified XML

<span><div id="Content"></span>
  <span><p></span>1<span></p></span>
  <span><h2></span>1<span></h2></span>
  <span><h2></span>2<span></h2></span>
<span></div></span>
Fjodr
  • 919
  • 13
  • 32
Alex W.
  • 546
  • 1
  • 7
  • 22
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Der Kommissar Apr 27 '15 at 19:12
  • The resulting file is not valid XML or HTML. – p.s.w.g Apr 27 '15 at 19:13
  • Please only use snippets for executable examples written JavaScript, HTML, and/or CSS. If you just want to display a snippet of code from some other language or just to display the syntax of a file (without a full example), use code blocks. – p.s.w.g Apr 27 '15 at 19:18
  • 1
    Why do you want to do that? – Casimir et Hippolyte Apr 27 '15 at 19:22
  • @CasimiretHippolyte: You yourself said once you are a computer scientist. We are people who use that science in practice, and in real life, where there are cases when we need malformed XML/HTML/.+ML. Just bear with that, it is something we have to live with. Do you want me to show at least one real life example? – Wiktor Stribiżew Apr 27 '15 at 19:30
  • @CasimiretHippolyte stribizhew may have better examples, but for my purpose, I want to add for anything I don't want to translate to another language before sending it to google translation service. – Alex W. Apr 27 '15 at 19:40
  • @stribizhev: I have never said I am a computer scientist because I am not a computer scientist. I asked this naive question because I have seen something incoherent, and when you see something incoherent, a wrong approach is hidden behind. It is the kind of things I have learned with the **practice**. – Casimir et Hippolyte Apr 27 '15 at 21:45
  • @AlexW.: the good approach is to extract the text content and to send it to the service without html tags, not to enclose all between span tags. – Casimir et Hippolyte Apr 27 '15 at 21:49
  • @CasimiretHippolyte Thanks for the suggestion. I have thought about it, but can you shed some light on how to achieve it? I assume the XML tree structure knows where text contents are so I can put them back to their spots after translation. – Alex W. Apr 28 '15 at 14:41
  • @AlexW.: indeed, all you need is to work text node by text node. I'm not a .net expert but this article (https://support.microsoft.com/en-us/kb/308333/) may help you since it explain how you can query an XML document with XPath (the query `//text()` returns all the text nodes of the document). – Casimir et Hippolyte Apr 28 '15 at 15:05

3 Answers3

1

I suggest to avoid using regex with xhtml, since it's well known that there are better tools. You could use xml parser, xquery, xpath, etc.

However, if you still have to use or want to use regex then you have to use capturing groups and also use a non greedy regex. You can use this:

(<.*?>)

working demo

Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
  • There is no point in creating additional capturing groups, this is unnecessary overhead. You can use `$&` to replace with the full match. Also, your regex will not match cases when tag attributes span across multiple lines: https://regex101.com/r/dB1wQ7/2. – Wiktor Stribiżew Apr 27 '15 at 19:28
0

Here is a working example of how to achieve this more or less safe:

var xml = "<div id=\"Content\">\r\n  <p>1</p>\r\n  <h2>1</h2>\r\n  <h2>2</h2>\r\n</div>";
var result = Regex.Replace(xml, @"<[^>]+?>", @"<span>$&</span>");

The regex used is <[^>]+?> that just matches <, then anything that is not > up to >.

Output:

enter image description here

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I just checked, and removed the unnecessary (in this case) `Singleline` option. I am not using `.` regex metacharacter like Fede, so I do not need it as `[^>]+` already matches newline symbols. – Wiktor Stribiżew Apr 27 '15 at 20:13
-1

How about this

            string input = "<div id=\"Content\">" +
                           "<p>1</p>" +
                           "<h2>1</h2>" +
                           "<h2>2</h2>" +
                           "</div>";
            string pattern = @"(</?\w+>)";

            string output = Regex.Replace(input, pattern, "<span>$1</span>");
            output = "<span>" + output + "</span>";​
jdweng
  • 33,250
  • 2
  • 15
  • 20