I've got a json payload from a 3rd party RESTful service. So right off the bat, please don't reply with "tell the vendor to fix it", because I already tried.
The issue is that one of the fields is an XML structure, and sometimes the structure has unescaped double quotes.
I'm working up a custom json deserializer to use on this field, so I've got the XML as a string to work on. I need to match all unescaped double quotes so I can replace them with escaped quotes. I'm using C# and regex.
Here's a rather ugly sample that shows the structure:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<SPC GENERATOR="Converter" TIME="Tue May 07 05:05:43 2019" SRC="\\word\document.xml">
<stuff>
<toad>A1,"Description"</toad>
<frog>the \"Description of the "object" A1\"</frog>
<tadpole>what about \"this\" one?</tadpole>
</stuff>
The tricky part is I don't want the double quotes inside the <?xml>
or <SPC>
elements escaped.
I've tried all kinds of permutations, along the lines of:
(?:<([A-Z][A-Z0-9]*)\b[^>]*>)((.*?)"?(.*?))(?:<\/\1>)
but I can't seem to get the quotes inside the elements matched.
Any help (that doesn't tell me it shouldn't be done, or get the payload fixed at the source, etc.) will be appreciated.
First, I'll reiterate - please DO NOT TELL ME NOT TO DO THIS. That advice/response is irrelevant to the problem, and if I could use a bloody parser, I would. The problem is because the data is badly formatted because of one issue, and I want to fix the issue before parsing.
That said, there's a bigger issue that I didn't see at first because I was focusing on just the XML. The double quote is not just an XML issue. It's causing the outer json to stop processing. When I debugged the custom JsonConverter ReadJson method, I found that the reader value had the start of the XML field, but stopped at the first problematic double quote.
So it's really an issue with this:
{
"Limit":1,
"Offset":1,
"TotalRecords":1,
"TotalPages":1,
"Message":null,
"Resource":
{
"Content":"<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<SPC GENERATOR="Converter" TIME="Tue May 07 05:05:43 2019" SRC="\\word\document.xml">
<stuff>
<toad>A1,"Description"</toad>
<frog>the \"Description of the "object" A1\"</frog>
<tadpole>what about \"this\" one?</tadpole>
</stuff></SPC>",
"EmcUrl":"http://stuff",
"Id":21188,
"Version":3
}
}
The reader value is returning:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<SPC GENERATOR="Converter" TIME="Tue May 07 05:05:43 2019" SRC="\\word\document.xml">
<stuff>
<toad>A1,
So I need to fix the double quote issue in the string, then let the JsonConvert.DeserializeObject() method do its thing. My bad, but now the issue might still be solvable with regex replacement, but it would have to avoid escaping the quotes in the surrounding json fields.
So yeah, I can't see where this is a duplicate. I would prefer a reply that told me a way to make this work (useful) rather than telling me that this is a bad idea (not useful), please.