0

I'm trying to convert a 30GB XML file into an XSD file but I run into trouble because I can't even parse the XML file. Normally I'd do this using an online converter but the file is too huge for any of these converters. I've tried doing this in both Python and R but the error seems to point to the same problem at the same line. Here is the R-code:

library(XML)
file <- 'xmlfile.xml'
data <- xmlParse(file)

But I get the following errors:

Error: 1: input conversion failed due to input error, bytes 0x81 0xC5 0x70 0x6E
2: input conversion failed due to input error, bytes 0x81 0xC5 0x70 0x6E
3: encoder error4: Premature end of data in tag source line 12088028
5: Premature end of data in tag html line 12088027
6: Premature end of data in tag content line 12088025
7: Premature end of data in tag delivery line 12087957
8: Premature end of data in tag collection-delivery line 2

Is there a way to ignore lines like these? Or maybe there are other ways to convert from XML to XSD?

It's perfectly fine to give answers in both Python or R.

Parseval
  • 503
  • 1
  • 4
  • 14
  • Perhaps edit the malformed tags in an editor?[premature](https://stackoverflow.com/questions/18410029/error-in-xml-file-premature-end-of-data-in-tag), then process. – Chris Mar 17 '22 at 12:19
  • try printing those lines to see what's the problem: `sed -n '12088025,12088028 p' file.xml` and also add them to the question if possible. – LMC Mar 17 '22 at 12:50
  • We can't help without seeing the XML (reduced to the smallest sample that illustrates the problem). See duplicate link for advice on how to parse bad "XML." – kjhughes Mar 17 '22 at 13:57
  • Repairing a broken 30Gb XML file is a pretty hopeless task. Find out how it was generated in the first place, and fix the process that generated it to get it right. – Michael Kay Mar 17 '22 at 14:17

0 Answers0