1

I have a really big XML files and somehow they fail validation as user's used < and > instead of &lt; and &gt; in attributes.

Is there a way in C# .Net Core to replace all the < and > with &lt; and &gt; quickly?

I have XML like so:

<?xml version="1.0" encoding="utf-8"?>
<rootXML test="b < a">
<inside anotherTest="i could have < and > in here">Hello < all</inside>
</tootXML>

The hard part is I don't know where the < and > are and the XMLs can be quite different.

Thanks.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
cdub
  • 24,555
  • 57
  • 174
  • 303
  • 3
    You don't have a really big XML file, you have a heap of junk. – Michael Kay Sep 20 '21 at 23:12
  • Assuming you're just worried about attributes, preprocess the string to replace each `>` or `<` that appears after an odd number of `"` characters. – David Browne - Microsoft Sep 21 '21 at 00:10
  • @DavidBrowne-Microsoft: That won't work. Quotes can appear elsewhere besides as attribute value delimiters. The inescapable problem is that it's impossible to parse an undefined language. For a collection of practical methods to (try to) ameliorate such markup messes, see [How to parse invalid (bad / not well-formed) XML?](https://stackoverflow.com/q/44765194/290085). – kjhughes Sep 21 '21 at 00:54
  • In general, yes. IE fi there is XML or HTML pasted into attributes or text nodes nothing will work. But in specific cases you can preprocess the text to make it valid. `>` and '<` appearing in text nodes is more complex, but not terribly much so. Text nodes always appear after `>` and are always followed by ``. Of course if the text contains `` you're screwed. But if it's just `>` or `<` you can figure it out. – David Browne - Microsoft Sep 21 '21 at 01:10
  • @DavidBrowne-Microsoft: Assuming rule-breaking to be bound by rules is unwise. – kjhughes Sep 21 '21 at 01:17

0 Answers0