2

I have a huge file that consists of malformed xml (mostly unescaped XML characters and CDATA sections). I am looking for a decent parser that can fix-up the malformed xml. I have used IntelliJIdea to work around some smaller xml files, but the IDE freezes when I give it a huge file.

Are there any decent tools that can fix up malformed XML?

newdev14
  • 1,091
  • 4
  • 15
  • 25
  • possible duplicate of [Dealing with malformed XML](http://stackoverflow.com/questions/28909882/dealing-with-malformed-xml) – Sobrique Jun 02 '15 at 08:57

1 Answers1

0

I'm sure someone will tell you to go back and fix the generator of the file. If that's possible, it certainly would be the best thing to do.

It sounds like you're planning to do this more or less by hand - looking for patterns of defects and fixing them up. For that, I'd use Notepad++ - just because I know it, it will handle really big files, and has good search/replace features, including regular expressions. There's a lot of room for improvement, though - in particular, the regular expression language is a bit weak if you're a regexpert.

Anything that tries to understand the XML to do more than chromacoding is likely to be slow when dealing with a file like this.

The XML support in Intellij is shockingly bad, performance-wise, given its overall excellence.

Ed Staub
  • 15,480
  • 3
  • 61
  • 91
  • I use and like OxygenXML, but didn't find it good for this kind of work. I think they improved their big-file performance since then. What techniques are you using - just normal search-replace with Oxygen pointing out defects, or something else? – Ed Staub Jul 22 '11 at 01:44
  • For escaping xml characters, I just did a search/replace in VI. I was having a hard time identifying sections that need to be put under CDATA. So for that part, I had to validate the XML file several times and fix up the sections that needed to be put under CDATA. Performance wise, I found OxygenXML to be much better than IntelliJ for this XML work. The only drawback with OxygenXML is that it doesn't validate the entire file in one go (I guess thats probably for performance reasons)...whereas IntelliJ you see all the problems and need to validate the file just once and fix the issues. – newdev14 Jul 22 '11 at 03:36