0

I have a 5 MB XML flat structure that I want to access its data later. I use XOM Parser in Java to parse the XML and I don't want to loop on the whole Tree every time I want to retrieve data as it takes a while because of the file size.

The XML looks like this

<TypeDesc Type="Person" Id="1" PKey="X0" xml:lang="EN" ShDes="t1" LongDes="test 1"/>
<TypeDesc Type="Person" Id="2" PKey="X1" xml:lang="EN" ShDes="t2" LongDes="test 2"/>
<TypeDesc Type="Person" Id="3" PKey="X3" xml:lang="EN" ShDes="t3" LongDes="test 2"/>
...
<TypeDesc Type="Person" Id="n" PKey="PAYMN" xml:lang="EN" ShDes="PAYMN" LongDes="payment"/>
<TypeDesc Type="Student" Id="1" PKey="X0" xml:lang="EN" ShDes="t1" LongDes="good"/>
<TypeDesc Type="Student" Id="2" PKey="X1" xml:lang="EN" ShDes="t2" LongDes="bad"/>
<TypeDesc Type="Student" Id="3" PKey="X3" xml:lang="EN" ShDes="t3" LongDes="fair"/>
...
<TypeDesc Type="Student" Id="n" PKey="PAYMN" xml:lang="EN" ShDes="PAYMN" LongDes="fair"/>

In my LOGIC I want to retrieve the longDes of the Node if PKEY = SOMESTUFF AND Type = OtherStuff

Looping on the whole thing and retrieving the longDes if other attributes are satisfied is very expensive.

How can I store my Data so that I can access them in O(1) instead of O(n) so that I do loop on the whole XML for one time and access the data structure for later iterations.

mowienay
  • 1,264
  • 4
  • 19
  • 32

2 Answers2

1

You are unlikely to find a constant-time lookup procedure for satisfying this in its current form. Moreover, is constant-time lookup a specific requirement or are you making that up as part of a blinkered viewpoint of your project's status/setup? A.K.A. "the XY problem". The best you're likely to find is an O(n log n) or O(log n) algorithm; see the Big O Cheatsheet

I recommend you review existing frameworks that will enable parsing of this structure:

  1. Xstream
  2. JAXB
  3. XML Beans

If you're happy with XOM, don't bother moving across, but I believe you need to consider the structure of data when you're searching, such as using an index, or store it in an efficient form -- e.g. a Prefix Tree/Trie -- and then serialize that to disk/storage so that re-parsing is quicker though an obvious space/time tradeoff?

On top of this, does your data have to be in XML? Can you convert it to another format? Such as Protocol Buffers, or placing the data in a database (either SQL or NoSQL), though this may be overkill depending on what you're doing?

I'd also ask myself the following questions:

  1. How do I get given this data? Am I losing information that may aid lookup?
  2. Will an efficient search algorithm aid here?
  3. Is this data sorted? Can I sort it efficiently so that subsequent lookups are more efficient?
Alex
  • 8,093
  • 6
  • 49
  • 79
  • It is my view point. I just don't want to loop again and again for every instance of Student \ Person I have to check the value of its attributes. Unfortunately, this is the standard format I have to use. – mowienay Sep 18 '13 at 11:05
  • That's a good viewpoint, don't get me wrong, but the constant-time requirement might be a bit extreme and premature. – Alex Sep 18 '13 at 11:09
  • It is ok, no offence at all. I don't really want the O(1) complexity. I just don't want O(n) on that big file. I already use XOM parser and it is very efficient for my requirements. Do you think it is ok to use one of these frameworks. And how are they going to store my data. As if they store each node in an object that won't solve the problem – mowienay Sep 18 '13 at 11:12
0

I used a hash table to store data. Constructed a hash table for each type. The key of each hash table is concatenation of all attributes I want to check with and the stored value is what I want to retrieve. It is very efficient and close to O(1)

mowienay
  • 1,264
  • 4
  • 19
  • 32
  • Whilst this solution works (but is very ugly, but a solution nonetheless), I think your reasoning is misguided. I suspect you're assuming the lookup cost: see http://stackoverflow.com/a/2771398/27385 – Alex Sep 19 '13 at 08:02