2

I am trying to parse the stack overflow data dump, one of the tables is called posts.xml which has around 10 million entry in it. Sample xml:

<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="1" PostTypeId="1" AcceptedAnswerId="26" CreationDate="2010-07-07T19:06:25.043" Score="10" ViewCount="1192" Body="&lt;p&gt;Now that the Engineer update has come, there will be lots of Engineers building up everywhere.  How should this best be handled?&lt;/p&gt;&#xA;" OwnerUserId="11" LastEditorUserId="56" LastEditorDisplayName="" LastEditDate="2010-08-27T22:38:43.840" LastActivityDate="2010-08-27T22:38:43.840" Title="In Team Fortress 2, what is a good strategy to deal with lots of engineers turtling on the other team?" Tags="&lt;strategy&gt;&lt;team-fortress-2&gt;&lt;tactics&gt;" AnswerCount="5" CommentCount="7" />
  <row Id="2" PostTypeId="1" AcceptedAnswerId="184" CreationDate="2010-07-07T19:07:58.427" Score="5" ViewCount="469" Body="&lt;p&gt;I know I can create a Warp Gate and teleport to Pylons, but I have no idea how to make Warp Prisms or know if there's any other unit capable of transporting.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;I would in particular like this to built remote bases in 1v1&lt;/p&gt;&#xA;" OwnerUserId="10" LastEditorUserId="68" LastEditorDisplayName="" LastEditDate="2010-07-08T00:16:46.013" LastActivityDate="2010-07-08T00:21:13.163" Title="What protoss unit can transport others?" Tags="&lt;starcraft-2&gt;&lt;how-to&gt;&lt;protoss&gt;" AnswerCount="3" CommentCount="2" />
  <row Id="3" PostTypeId="1" AcceptedAnswerId="56" CreationDate="2010-07-07T19:09:46.317" Score="7" ViewCount="356" Body="&lt;p&gt;Steam won't let me have two instances running with the same user logged in.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Does that mean I cannot run a dedicated server on a PC (for example, for Left 4 Dead 2) &lt;em&gt;and&lt;/em&gt; play from another machine?&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Is there a way to run the dedicated server without running steam? Is there a configuration option I'm missing?&lt;/p&gt;&#xA;" OwnerUserId="14" LastActivityDate="2010-07-07T19:27:04.777" Title="How can I run a dedicated server from steam?" Tags="&lt;steam&gt;&lt;left-4-dead-2&gt;&lt;dedicated-server&gt;&lt;account&gt;" AnswerCount="1" />
  <row Id="4" PostTypeId="1" AcceptedAnswerId="14" CreationDate="2010-07-07T19:11:05.640" Score="10" ViewCount="201" Body="&lt;p&gt;When I get to the insult sword-fighting stage of The Secret of Monkey Island, do I have to learn every single insult and comeback in order to beat the Sword Master?&lt;/p&gt;&#xA;" OwnerUserId="17" LastEditorUserId="17" LastEditorDisplayName="" LastEditDate="2010-07-08T21:25:04.787" LastActivityDate="2010-07-08T21:25:04.787" Title="Do I have to learn all of the insults and comebacks to be able to advance in The Secret of Monkey Island?" Tags="&lt;monkey-island&gt;&lt;adventure&gt;" AnswerCount="3" CommentCount="2" />

I would like to parse this xml, but only load certain attributes of the xml, which are Id, PostTypeId, AcceptedAnswerId and other 2 attributes. Is there a way in SAX so that it only loads these attributes?? If there is then how? I am pretty new to SAX, so some guidance would help.

Otherwise loading the whole thing would just be purely slow and some of the attributes won't be used anyways so it's useless.

One other question is that would it be possible to jump to a particular row that has a row Id X? If possible then how do I do this?

aherlambang
  • 14,290
  • 50
  • 150
  • 253
  • This is data from data.stackexchange.com? – Buhake Sindi Apr 15 '11 at 21:02
  • SAX still has to parse the input whether or not you do anything with it. And since the big cost is going to be extracting strings from that input (actually, garbage-collecting those strings), there's not much point in trying to filter the attributes that it actually gives to you. – Anon Apr 15 '11 at 22:21
  • Are you sure you WANT to use SAX? If you just need to use something more light-weight than DOM, perhaps have a look at Stax (javax.xml.stream) which is as fast as SAX, but often bit simpler to use since you iterate over content instead of writing event handlers. As to jumping to particular row; no, neither allows this by default. Typically one uses XPath to locate things this way, but that requires full in-memory tree (DOM/XOM/JDOM) – StaxMan Apr 16 '11 at 20:01

4 Answers4

1

"StartElement" Sax Event permits to process a single XML ELement.

In java code you must implement this method

public void startElement(String uri, String localName,
    String qName, Attributes attributes)
    throws SAXException {

    if("row".equals(localName)) {
        //this code is executed for every xml element "row"
        String id = attributes.getValue("id");
        String PostTypeId = attributes.getValue("PostTypeId");
        String AcceptedAnswerId = attributes.getValue("AcceptedAnswerId");
        //others two
        // you have your att values for an "row" element
    }

 }

For every element, you can access:

  1. Namespace URI
  2. XML QName
  3. XML LocalName
  4. Map of attributes, here you can extract your two attributes...

see ContentHandler Implementation for specific deatils.

bye

UPDATED: improved prevous snippet.

m.genova
  • 377
  • 6
  • 15
0

Yes, you can override methods that process only the elements you want:

Buhake Sindi
  • 87,898
  • 29
  • 167
  • 228
duffymo
  • 305,152
  • 44
  • 369
  • 561
0

It is pretty much the same approach as I've answered here already.

Scroll down to the org.xml.sax Implementation part. You'll only need a custom handler.

Community
  • 1
  • 1
Octavian Helm
  • 39,405
  • 19
  • 98
  • 102
0

SAX doesn't "load" elements. It informs your application of the start and end of each element, and it's entirely up to your application to decide which elements it takes any notice of.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164