0

I have a 20gb bz2 xml file. the format is like this:

<doc id="1" url="https://www.somepage.com" title="some page">
text text text ....
</doc>

I need to process it to tsv file in this format:

id<tab>url<tab>title<tab>processed_texts

What is the most efficient way of doing it in python and java and what are the differences (memory efficiency and speed wise). Basically I want to do this:

read bz2 file
read the xml file element by element
for each element
    retrieve id, url, title and text
    print_to_file(id<tab>url<tab>title<tab>process(text))

Thanks for your answers in advance.

UPDATE1 (Based on @Andreas suggestions):

XMLInputFactory factory = XMLInputFactory.newFactory();
XMLStreamReader xmlReader = factory.createXMLStreamReader(in);
xmlReader.nextTag(); 
    if (! xmlReader.getLocalName().equals("doc")) {
        xmlReader.nextTag(); }

        String id      = xmlReader.getAttributeValue(null, "id");
        String url     = xmlReader.getAttributeValue(null, "url");
        String title   = xmlReader.getAttributeValue(null, "title");
        String content = xmlReader.getElementText();
        out.println(id +  '\t' + content);

The problem is that I only get the first element.

UPDATE2 (I ended up doing it using regex):

if (str.startsWith("<doc")) {
                id = str.split("id")[1].substring(2).split("\"")[0];
                url = str.split("url")[1].substring(2).split("\"")[0];
                title = str.split("title")[1].substring(2).split("\"")[0];
     }
else if (str.startsWith("</doc")) {
                out.println(uniq_id +  '\t' + contect);
                content ="";

      } 
else {
                content = content + " " + str;
      }
Nick
  • 367
  • 4
  • 6
  • 13

1 Answers1

1

Note: The answer below works well for parsing very large BZ2 compressed XML documents, however OP's XML file is not well-formed since there is no root element, i.e. it's an XML fragment.

The built-in StAX parser does not support XML fragments, however the Woodstox XML processor supposedly supports this, according to this answer: Parsing multiple XML fragments with STaX.


Java Answer

As answered in this question (Uncompress BZIP2 archive), you need Apache Commons Compress™ to read BZ2 files.

You would then use the built-in StAX parser:

File xmlFile = new File("input.xml");
File textFile = new File("output.txt");
try (InputStream in = new BZip2CompressorInputStream(new FileInputStream(xmlFile));
     PrintWriter out = new PrintWriter(new FileWriter(textFile))) {

    XMLInputFactory factory = XMLInputFactory.newFactory();
    XMLStreamReader xmlReader = factory.createXMLStreamReader(in);
    try {
        xmlReader.nextTag(); // Read root element, ignore it
        if (xmlReader.getLocalName().equals("doc"))
            throw new IllegalArgumentException("Expected root element, found <doc>");
        while (xmlReader.nextTag() == XMLStreamConstants.START_ELEMENT) {
            if (! xmlReader.getLocalName().equals("doc"))
                throw new IllegalArgumentException("Expected <doc>, found <" + xmlReader.getLocalName() + ">");
            String id      = xmlReader.getAttributeValue(null, "id");
            String url     = xmlReader.getAttributeValue(null, "url");
            String title   = xmlReader.getAttributeValue(null, "title");
            String content = xmlReader.getElementText();
            // process content value
            out.println(id + '\t' + url + '\t' + title + '\t' + content);
        }
    } finally {
        xmlReader.close();
    }
}

Fast and low memory footprint.

Community
  • 1
  • 1
Andreas
  • 154,647
  • 11
  • 152
  • 247
  • The solution gives me this error: Message: found: `CHARACTERS, expected START_ELEMENT or END_ELEMENT`. I tried various solutions based on yours, but I can only pasre the first doc element. – Nick Oct 14 '15 at 23:57
  • it might be easier even to just process it as a text with regex. – Nick Oct 15 '15 at 00:38
  • Oops, was missing `!` on the `if` statement. Also, if your XML is not well-formed, and is missing the root element, you'll get that error. – Andreas Oct 15 '15 at 01:33
  • Also, regex is generally a very bad choice for parsing XML. XML formatting is way too complex for regex. – Andreas Oct 15 '15 at 01:35
  • I have no root element. It throws me this error: `Expected root element, found `, removing that if and error message, gives me this error: Message: found: `CHARACTERS, expected START_ELEMENT or END_ELEMENT` which is related to not having the root. – Nick Oct 15 '15 at 14:38
  • Remove the line `xmlReader.nextTag(); // Read root element, ignore it` too, and you should be good, though `nextTag()` might fail at the end because the XML is not well-formed. That can be fixed by replacing the convenience method (`nextTag()`) with the underlying loops. – Andreas Oct 15 '15 at 14:54
  • I did that the first try and I got `Message: found: CHARACTERS, expected START_ELEMENT or END_ELEMENT`. I'll show my update code in the question as update. – Nick Oct 15 '15 at 15:01