0

I've got some badly-formed XML files using Python, and I need to figure out what's wrong with them (ie. what the errors are) without actually looking at the data (the files are sensitive client data).

I figure there should be a way to sanitize the XML (ie. remove all content in all nodes) but keep the tags, so that I can see any structural issues.

However, ElementTree doesn't return any detailed information about mismatched tags - just a line number and a character position which is useless if I can't reference the original XML.

Does anyone know how I can either sanitize the XML so I can view it, or get more detailed error messages for badly formed XML (that won't return tag contents)? I could write a customer parser to strip content, but I wanted to exhaust other options first.

khalid13
  • 2,767
  • 2
  • 30
  • 48

1 Answers1

1

It's a hard enough problem to try to automatically fix markup problems when you can look at the file. If you're not permitted to see the document contents, forget about having any reasonable hope of fixing such doubly undefined problems.

Your best bet is to fix the bad "XML" at its source.

If you can't do that, I suggest that you use a tool listed in How to parse invalid (bad / not well-formed) XML? to attempt to automatically repair the well-formedness problem. Then, after you actually have XML, you can use XML tools to strip or sanitize content (if that's even still necessary at that point).

kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • I don't need to fix the problems automatically - I just need to get the document to a state where it retains structure but not content. If I can eyeball it and determine the deformity in the XML, I can fix them at source. – khalid13 Sep 13 '18 at 15:32
  • You can't know the intended structure when the markup that conveys that structure is not well-formed, because not being well-formed necessarily implies that the structural information is missing (or damaged, which in the general case is no better than missing). – kjhughes Sep 13 '18 at 15:36
  • And you can't fix the problems manually without seeing the content, so your only options are to fix the source that's generating the bad XML or to attempt to automatically repair the bad XML so you don't have to see the content. See the linked Q/A on how to do the latter. – kjhughes Sep 13 '18 at 15:39
  • Do *you* control the source of the not-well-formed "XML"? – kjhughes Sep 13 '18 at 17:09
  • sort of - the generator is a long and obnoxious method with lots of code paths, but yes, I can change it. It mostly works - but I'm trying to catch the edge cases that are generating bad XML and fix them. To do that, I want to figure out examples of what might be going wrong so I can look at the generator and fix it (again, without looking at the content). – khalid13 Sep 13 '18 at 17:33
  • Then given that you cannot share the input data to the broken program, nor the output of the broken program, nor the broken program itself, I'm afraid that all the advice that remains to give is this: ***Fix the broken program.*** Sorry, but you're in no position to ask for help given that you're unable to share anything that would enable anyone to help you. – kjhughes Sep 13 '18 at 18:03
  • I think retaining an XML's structure while deleting its contents is not an unreasonable ask. I will work on it and post a response where when I have something passable. I'm not disagreeing that broken program needs to be fixed - I am trying to discover what set of inputs leads it to break in order to fix it by reverse examining output structure. – khalid13 Sep 13 '18 at 19:01
  • Yes, *retaining an XML's structure while deleting its contents is not an unreasonable ask*, but you don't have XML. You have "XML," and that makes all the difference in the world: "XML" cannot be parsed because "XML" adheres to no standards or rules. This is my last comment. Good luck. – kjhughes Sep 13 '18 at 23:11