Regex to match all non-escaped double quotes in XML string

Question

I've got a json payload from a 3rd party RESTful service. So right off the bat, please don't reply with "tell the vendor to fix it", because I already tried.

The issue is that one of the fields is an XML structure, and sometimes the structure has unescaped double quotes.

I'm working up a custom json deserializer to use on this field, so I've got the XML as a string to work on. I need to match all unescaped double quotes so I can replace them with escaped quotes. I'm using C# and regex.

Here's a rather ugly sample that shows the structure:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<SPC GENERATOR="Converter" TIME="Tue May 07 05:05:43 2019" SRC="\\word\document.xml">
<stuff>
    <toad>A1,"Description"</toad>
    <frog>the \"Description of the "object" A1\"</frog>
    <tadpole>what about \"this\" one?</tadpole>
</stuff>

The tricky part is I don't want the double quotes inside the <?xml> or <SPC> elements escaped.

I've tried all kinds of permutations, along the lines of:

(?:<([A-Z][A-Z0-9]*)\b[^>]*>)((.*?)"?(.*?))(?:<\/\1>)

but I can't seem to get the quotes inside the elements matched.

Any help (that doesn't tell me it shouldn't be done, or get the payload fixed at the source, etc.) will be appreciated.

First, I'll reiterate - please DO NOT TELL ME NOT TO DO THIS. That advice/response is irrelevant to the problem, and if I could use a bloody parser, I would. The problem is because the data is badly formatted because of one issue, and I want to fix the issue before parsing.

That said, there's a bigger issue that I didn't see at first because I was focusing on just the XML. The double quote is not just an XML issue. It's causing the outer json to stop processing. When I debugged the custom JsonConverter ReadJson method, I found that the reader value had the start of the XML field, but stopped at the first problematic double quote.

So it's really an issue with this:

{
  "Limit":1,
  "Offset":1,
  "TotalRecords":1,
  "TotalPages":1,
  "Message":null,
  "Resource":
    {
      "Content":"<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<SPC GENERATOR="Converter" TIME="Tue May 07 05:05:43 2019" SRC="\\word\document.xml">
<stuff>
    <toad>A1,"Description"</toad>
    <frog>the \"Description of the "object" A1\"</frog>
    <tadpole>what about \"this\" one?</tadpole>
</stuff></SPC>",
 "EmcUrl":"http://stuff",
  "Id":21188,
  "Version":3
  }
}

The reader value is returning:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<SPC GENERATOR="Converter" TIME="Tue May 07 05:05:43 2019" SRC="\\word\document.xml">
<stuff>
    <toad>A1,

So I need to fix the double quote issue in the string, then let the JsonConvert.DeserializeObject() method do its thing. My bad, but now the issue might still be solvable with regex replacement, but it would have to avoid escaping the quotes in the surrounding json fields.

So yeah, I can't see where this is a duplicate. I would prefer a reply that told me a way to make this work (useful) rather than telling me that this is a bad idea (not useful), please.

Unescaped double quotes in element text are OK in XML. What output do you expect? — choroba, Sep 18 '19 at 15:35
(1) What @choroba said. (2) Don't parse XML with regex; use a real XML parser. Your current self, your future self, and everyone who has to use or maintain your code will thank you. — kjhughes, Sep 18 '19 at 17:25
@kjhughes, since I agree that we should NOT parse (?:HT|X)ML content with regexp, this substitution seems here to be performed in "[a limited and known set of HTML](https://stackoverflow.com/a/1733489/4375327)". — Amessihel, Sep 18 '19 at 17:29
@kjhughes, I agree on what said Rob. Except, if you say "counting < and > occurences is better than a regexp", mmh no. They are for me both quirk fixes dealing with a bad upstream product. This in mind, if my answer encourages bad practices, I'm ready to remove it. — Amessihel, Sep 18 '19 at 17:44
And if the condescending responses were because of my not supplying the bigger contextual picture, me culpa and apologies for that. Asking for more would have been a welcome response, BTW. — Stan Spotts, Sep 18 '19 at 20:18
See [What is the XY problem?](https://meta.stackexchange.com/q/66377/234215) and [How should I escape strings in JSON?](https://stackoverflow.com/q/3020094/290085) — kjhughes, Sep 19 '19 at 00:03
The kicker is that I can't just escape all double quotes because some in the XML string are required not to be escaped. As for X-Y questions, well, had the responses simply been what I actually asked for, the context would have been irrelevant. Having to "prove" the need and expand every bit does not affect the original request of wanting a regex pattern that could match every instance of an unescaped double quote that's between two arbitrary strings. I specifically asked NOT to receive opinions like "you shouldn't do that", which should have been the cue that I already knew it. — Stan Spotts, Sep 19 '19 at 03:11

score 2 · Answer 1 · answered Sep 18 '19 at 15:48

2

I feel your pain! Especially not wanting the comments that tell you the problem shouldn't exist ... I'll try not to do that.

If I understand your problem correctly you have an issue in that your response is json, one of the json fields contains XML, and you need to be able to extract the XML correctly when it may include quotes that for a normal json parsing process would terminate the content.

(Incidentally this is one of the drawbacks of json structures and it never ceases to amaze me why people use json instead of XML - but then much of what happens in the workld makes no sense to me.)

I would be inclined to try a different approach. I'm a C++ programmer not C# so I won't give you any code but I expect you can sort that out yourself.

If you know which field contains the XML, then you can get to the start of that field in the json stream. You then want to find the end of that field in the json stream ... at that point I would take the rest of the content (all the XML and the rest of the json (including this field's terminator)) and parse it as XML. Providing the XML is well formed you should be able to find the last tag, and then you know that's where the XML finishes.

I'm not sure you even need to use a RegEx, simply scan the string and increment a counter for every < character and decrement it for every > character. If you encounter the json field terminator with a counter of 0 you have reached the end of the XML. As < and > need to be encoded as < and > in the XML this should always work.

Does that help at all?

answered Sep 18 '19 at 15:48

Rob Lambden

2,175
6
15

Thanks for the response! So, yeah, getting the XML string is simple since I'm doing the deserialization of the json field. "Normal" deserialization is what barfs because of the unescaped quotes. I had it in my todo's to load the string into an XDocument instance and process it. But I figured there had to be a regex pattern that would let me use Regex.Replace() to fix them in one shot. It's just annoying that it's so tricky. – Stan Spotts Sep 18 '19 at 16:43
Once you have the Xml using a RegEx to process it should be simple, as pointed out in the comments to the question quotes don't need to be escaped when they're part of a text node, so I'm not sure why you think you need to escape them... – Rob Lambden Sep 18 '19 at 18:44
It appears that I need to replace unescaped double quotes that are found between the strings "". If it helps not to think about this as XML or json, and just as a string, maybe it'll help? :) – Stan Spotts Sep 18 '19 at 20:39
In my experience you shouldn't change twhat the contents of the ML string is _unless_ you know that it has itself been changed by a dodgy implementation (so you _know_ what it has sent is wrong. Use the counter technique to find the XML content - once you are in XML content you don't care about the quotes any more - you just care about < and > - and when you hit a quote with the counter 0 you know the XML has ended and you can go back to desrialising the json. – Rob Lambden Sep 18 '19 at 21:21
Dodgy isn't' the word, I've been trying to make the data from the 3rd party services usable for week, at least . Exactly why shouldn't I treat the json (prior to deserialization) as a simple text string and use regex to match/replace any unescaped double quote with an escaped one, while between the first and last XML element? There's no issue with attributes, as it's not HTML elements I'm dealing with. You'll love this: I've found that sometimes the vendor has neglected to encode ">" and "<" signs within a node. It doesn't affect the json deserialization, so I do use regex to fix those. – Stan Spotts Sep 19 '19 at 03:07
Note that I wouldn't be doing this at all if the response from the vendor wasn't along the lines of, our data has been created over a period of 15 years using various technologies, which weren't as robust as todays tools. But we have no plans to go back and fix the old data. – Stan Spotts Sep 19 '19 at 03:18
I've had responses like that from vendors before ... is there a community of developers using their data who can advise how they deal with it? Looking at the data, can you rely on being able to catch the XML content with **="(.*?)",/ms** which would work in your example but may not be robust enough in real life. Replace the matched string (XML) and save it - do your json parsing, and separately parse (or embed f you don't care about it) the XML data. – Rob Lambden Sep 19 '19 at 07:08
I haven't seen any community support, and the vendor is in the UK. Kind of a niche data set as well. I'm pretty much on my own with these guys, and the business says use this service. The xml always starts with the and a DOCTYPE, which I can ignore for the fixing. And it always starts and ends with an SPC tag, which is one of its few consistencies. – Stan Spotts Sep 19 '19 at 11:36
That regex will get all double quotes, but I need only the unescaped ones.(?<!\\)"(.*?)(?<!\\)" will do that, but I need it to happen only between the and tags. Unless I use string methods to extract that part then use regex on it and then put it back. – Stan Spotts Sep 19 '19 at 11:47
@Stan Spotts - Use the Regex (in Perl it would be **=~s/"Content":="(.*?)",/"Content":"XML",/ms**) to extract all of the XML data in one go. The trailing **",** only appears at the end of the json wrap of the XML. The grouped data will be the XML content. You can parse all of the json correctly, Resource.Content will be *XML*. Then you can set Resource.Content to be either the extracted XML itself (if you don't care about structure) or a separate parse of the XML if you do. If you then serialise Resource (if you need to) your own serialiser should give you valid json output if you need it. – Rob Lambden Sep 19 '19 at 12:46

Regex to match all non-escaped double quotes in XML string

1 Answers1