I'm trying to use Java and SAXParser to get information from the WikiData dump file (120 GB, bzipped).
This is the code:
XMLReader parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
parser.setContentHandler(this);
parser.setErrorHandler(this);
parser.setProperty("http://xml.org/sax/properties/lexical-handler", this);
FileInputStream in = new FileInputStream(xin);
BZip2CompressorInputStream bin = new BZip2CompressorInputStream(in);
parser.parse(new InputSource(bin));
At some point, after more than 770,000 WikiData pages correctly parsed, I get this error
[main] ERROR (AbstractWikipediaXmlDumpParser.java:119) - SAXParseException at Q843131 org.xml.sax.SAXParseException; lineNumber: 14861047; columnNumber: 1959; Invalid byte 2 of 4-byte UTF-8 sequence.
This is probably an error in the XML file, but I do not know how to solve it, since it's almost impossible to open a bzipped 120 GB file and fix a single character.
Is there a way to tell SAXPArser to ignore errors? Since I got the page title that gives the error (Q843131), I think the program can skip it, can't it?
I also search for a solution on the web, but most of the answers suggest to edit the file (impossible, since it's 120 GB bzipped in size) or to use some checkers (xmlstarlet, for example, considers the XML valid).