Regex or Xpath for extracting nodes?

Question

I have an XML file with the following structure;

<JobList>
 <Job><subnodes/></Job>
 <Job><subnodes/></Job>
</JobList>

This xml can be broken sometimes leaving a missing ending of <JobList> and missing end of </Job>.

I would like to be able to extract the <Job> nodes with full content on those that are closed with </Job>. What is the best way to do this?

To make a long story short I am using .NET and built in serializers for deserializing xml content. But since new properties are added you cannot just go back and forth between different versions as it is to strict. Mostly it works, but I would like to have a backup recovery method for this - hence the question.

The current situation is that the deserializer "crashes" the whole deserializing when a new property has been added instead of ignoring it. I am looking to manually parse it on error.

If the XML is broken, then fix it. I mean, fix the error in the upstream system which is generating it. — , Aug 10 '17 at 16:12
You don't have XML that is sometimes broken. What you have is *always* broken but accidentally happens to work once in a while. Whatever generates this file is improperly programmed and needs to be fixed. Trying to fix the file instead is a waste of time. — Tomalak, Aug 10 '17 at 16:26
Possible duplicate of [How to parse invalid (bad / not well-formed) XML?](https://stackoverflow.com/questions/44765194/how-to-parse-invalid-bad-not-well-formed-xml) — kjhughes, Aug 10 '17 at 16:42
Thanks for your answers but I was hoping not to discuss why I need this. To make a long story short I am using .NET and built in serializers for deserializing xml content. But since new properties are added you cannot just go back and forth between different versions as it is to strict. Mostly it works, but I would like to have a backup recovery method for this - hence the question. — serializer, Aug 10 '17 at 16:44
It is not a duplicate of that post as it is not about characters but general content. If the deserializer finds a new property it will crash the deseriarializing of the whole file instead of ignoring the property. — serializer, Aug 10 '17 at 16:46
Wrong. Well-formedness is not just about characters. You're completely missing the point that now three people are trying to tell you. If your data is missing an end-tag, ***it is not XML***, and you cannot use any XML tools to help you. Fix the data. The duplicate link addresses both the concept of well-formed (which you clearly do not yet understand) and remedies. — kjhughes, Aug 10 '17 at 16:47
This was one example of what could happen, for example if the deserialization is aborted by disk issues etc. I have tried to handle it but looking for last resort answers. The "solutions" is not related to what I want to achieve. But I do appreciate the answers. — serializer, Aug 10 '17 at 16:51
To be clear. The data is well-formed as it is created by .NET serializer. I do not touch that part. But, if you ever worked with this in .NET you will find that there is no way to have a more relaxed approach to deserialization. You cannot tell it to ignore properties that are unknown to an object. And this can happen between two versions of an object as properties can be added. — serializer, Aug 10 '17 at 16:57
You state: *This xml can be broken sometimes leaving a missing ending of `` and missing end of ``.* This means it's (sometimes) not well-formed. — kjhughes, Aug 10 '17 at 17:21
Yes, this is one of the potential problems. But as it happens in .NET I have no control over it. My question was about how to extract the nodes manually so I could manually fix the problems when it happens. My mistake was either to add the reason why here or adding to little information as the original problem is related to .NET. But my solution to this was to handle this manually when .NET fails. — serializer, Aug 10 '17 at 17:33
I do not believe that the .NET serializer would ever produce not well-formed XML. If you have proof otherwise, then post a **[mcve]**. You've already received the best advice possible given what you've asked. This comment trail is way too long and a sign that you've not properly formulated your question. See also [ask]. Thanks. — kjhughes, Aug 10 '17 at 17:57
Well, you do not have any experience in .NET on high level methods. I asked for help on regular expressions and not your comments on what possible problems with high level methods may exist or not. I agree that I could have been either less clear or more clear on the problem. I was not asking to analyze how .NET works or any potential problems with it but help on parsing on a lower level. Even though I agreed on me not formulating it perfectly and reformulated my original question you kept going on on irrelevant .NET questions. I guess you won some sort of prize. — serializer, Aug 10 '17 at 22:04

score 1 · Accepted Answer · answered Aug 11 '17 at 11:27

As mentioned on the comments, the ideal would be to make the xml valid, if for whatever reason that is not possible, the workaround is parsing the file as text with a regex. A general regex for this case could be something like:

<Job>((?!<Job>).)*</Job>$

this will bring anything between a complete pair Please notice that this will also return nodes with 'broken' inner nodes, but according to your question you are only concerned about missing and tags.

Regex or Xpath for extracting nodes?

1 Answers1