3

I have some VB.Net code which is parsing an XML string.

The XML String comes from a TCP 3rd Party stream and as such we have to take the data we get and deal with it. The issue we have is that one of the elements data can sometimes contain special characters e.g. &, $ , < and thus when the “XMLDoc.LoadXml(XML)” is executed it fails - note XMLDoc is configured as "Dim XMLDoc As XmlDocument = New XmlDocument()".

Have tried to Google answers for this but I am really struggling to find a solution. Have looked at a RegEX but realised this has some limitations; or I just dont understand it enough lol.

If it helps here is an example of XLM we would have streamed to us (just for info the message tag comes from an SMS message):- (if it helps the only bit that will ever have an error is (and all I have to check) the <Message>O&N</Message> section so in this case the message has come in with an &)

<IncomingMessage><DeviceSendTime>19/02/2013 14:00:50</DeviceSendTime>
 <Sender>0000111111</Sender>
 <Status>New</Status>
 <Transport>Sms</Transport>
 <Id>-1</Id>
 <Message>O&N</Message>
 <Timestamp>19/02/2013 14:00:50</Timestamp>
 <ReadTimestamp>19/02/2013 14:00:50</ReadTimestamp>
</IncomingMessage>
Brian Webster
  • 30,033
  • 48
  • 152
  • 225
  • That's just bad data. There's not much you can do except fix it at the source. If you know the XML standard pretty well there are a few regular expressions you can write to deal with some of it, but that's just a bandaid for a bigger problem. – Dustin Kingen Feb 19 '13 at 17:53
  • I agree with @Romoku cleaning the XML via the Regex is just a bandaid. – malkassem Feb 19 '13 at 17:56
  • 1
    `&` and `$` can probably be "cleaned" trivially, but `<` will be difficult. Could you provide examples where angle brackets show up in your stream? Also, which other "error" characters are you seeing? – Tim Pietzcker Feb 19 '13 at 18:00
  • 1
    Is there any particular reason you're using `XmlDocument` and not `XDocument` from Linq to XML? – Zev Spitz Feb 19 '13 at 18:03
  • If the only time you have a problem is within the `` tag, then you could use a regular expression for searching specifically within that tag, which would make it possible to search for `<` or `>` as well. Unless there are sometimes nested tags within the `Message`? – Zev Spitz Feb 19 '13 at 18:07
  • Just wondering, but would the HtmlAgilityPack ( http://htmlagilitypack.codeplex.com/ ) be able to parse it? – Andrew Morton Feb 19 '13 at 21:24

2 Answers2

3

If we're looking specifically within Message elements, and assuming there are no nested elements within the Message element:

Dim url = "put url here"
Dim s As String

Dim characterMappings = New Dictionary(Of String, String) From {
    {"&", "&amp;"},
    {"<", "&lt;"},
    {">", "&gt;"},
    {"""", "&quot;"}
}

Using client As New WebClient
    s = client.DownloadString(url)
End Using
s = Regex.Replace(s,
    "(?:<Message>).*?(" & String.Join("|", characterMappings.Keys) & ").*?(?:</Message>)",
    Function(match) characterMappings(match.Groups(1).Value)
)
Dim x = XDocument.Parse(s)

$ should not be an issue with XML, but if it is you can add it to the dictionary.

Use of WebClient comes from here.

Updated

Since $ has special meaning in regular expressions, it cannot be simply added to the dictionary; it needs to be escaped with \ in the regular expression pattern. The simplest way to do this, would be to write the pattern manually, instead of joining the keys to the dictionary:

s = Regex.Replace(s,
    "(?:<Message>).*?(&|<|>|\$).*?(?:</Message>)",
    Function(match) characterMappings(match.Groups(1).Value)
)

Also, I highly recommend Expresso for working with regular expressions.

Community
  • 1
  • 1
Zev Spitz
  • 13,950
  • 6
  • 64
  • 136
  • Hi Zev, Thanks very much for your response (and everyone else it is appreciated) The one good thing i have is that I only have to check the '' tag every other tag will be 100% fine. I have ticked your answer as I did look at this as a possible answer but was just not sure on the exact syntax of the RegEx expression so thanks for that. I am going to give this ago later today/tomorrow and will put my feedback here on how i get on. Once again thanks to you all for a quick response. Cheers, Steve (Just for info there are only a few chars we have problems with so just the $ to add really) – user2088072 Feb 20 '13 at 10:10
  • @user2088072 Are you sure the `$` is causing problems? It's not a special XML character, and therefore shouldn't prevent parsing as XML even if it's inside the data. – Zev Spitz Feb 21 '13 at 07:15
1

Your XML is invalid and hence it is not XML. Either fix code that generates XML (correct approach) or pretend this is text file and enjoy all problems with parsing non-structured text.

As you've stated in the question <Message>O&N</Message> is not valid XML. Most likely reason of such "XML" is using string concatenation to construct it instead of using proper XML manipulation methods. Unless you use some arcane language all practically used languages have built in or library support for XML creation so it should not be to hard to create XML right.

Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179
  • 2
    It's not *his* XML. That is the problem. – Tim Pietzcker Feb 19 '13 at 18:27
  • 1
    @TimPietzcker, it *is not XML*, so trying to parse it with XML parser is asking for trouble. It would be much easier to just do custom matching for fixed strings to get ranges instead of trying to shoehorn into XML. – Alexei Levenkov Feb 19 '13 at 18:31