0

I'm using Xpath to red XML files. The size of a file is unknown (between 700Kb - 2Mb) and have to read around 100 files per second. So I want fast a way to load and read from Xpath.

I tried to use java nio file channels and memory mapped files but was hard to use with Xpath. So can someone tell a way to do it ?

Prasad Weera
  • 1,223
  • 3
  • 13
  • 25

3 Answers3

1

A lot depends on what the XPath expressions are doing. There are four costs here: basic I/O to read the files, XML parsing, tree building, and XPath evaluation. (Plus a possible fifth, generating the output, but you haven't mentioned what the output might be.) From your description we have no way of knowing which factor is dominant. The first step in performance improvement is always measurement, and my first step would be to try and measure the contribution of these four factors.

If you're on an environment with multiple processors (and who isn't?) then parallel execution would make sense. You may get this "for free" if you can organize the processing using the collection() function in Saxon-EE.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Could you please tell me the difference between Saxon xpath and normal javax.xml Xpath – Prasad Weera Aug 22 '12 at 07:43
  • The most significant is that Saxon implements XPath 2.0. That has many implications; a simple one is that it provides the collection() function, which reads multiple XML files (e.g. all the files in a directory). But Saxon (especially the commercial version Saxon-EE) has many other features that could be relevant to your problem, for example streamed processing, and parallel execution. – Michael Kay Aug 23 '12 at 20:39
0

If I were you, I would probably drop Java in this case at all, not because you can't do it in Java, but because using some bash script (in case you are on Unix) is going to be faster, at least this is what my experience dealing with lots of files tells me.

On *nix you have the utility called xpath exactly for that.

Since you are doing lots of I/O operations, having a decent SSD disk would help way more, then doing it in separate threads. You still need to do it with multiple threads, but not more then one per CPU.

Eugene
  • 117,005
  • 15
  • 201
  • 306
-1

If you want performance I would simply drop XPath altogether and use a SAX parser to read the files. You can search Stackoverflow for SAX vs XPath vs DOM kind of questions to get more details. Here is one Is XPath much more efficient as compared to DOM and SAX?

Community
  • 1
  • 1
maneesh
  • 1,092
  • 1
  • 8
  • 11
  • Does Xpath load the whole file to the memory before starts querying ? – Prasad Weera Aug 22 '12 at 06:43
  • 1
    Poor answer. We don't know what the XPath is doing, so we don't know (a) how easily the XPath could be rewritten in Java/SAX, or (b) how expensive the XPath evaluation is in relation to the XML parsing. Downvoting. – Michael Kay Aug 22 '12 at 07:07
  • @Michael K: Well yes I dont know what its doing with XPath as its not mentioned in the question. So my answer was based on the general experience I have had with XPath. XPath is great where you just need to query few specific items but when you need to do lot of queries or just read the whole file its way slow than say SAX. Also he mentions he needs to read around 100 files per second so performance is important for him. I know you recommended Saxon-EE (product from your company) as a possible solution to speed things up but you failed to mention its not free. – maneesh Aug 23 '12 at 06:36
  • @Prasad: Yes it will load the whole file in to memory before running any queries. If you can post a sample XML file and code you use to read the XML files I might be able to give you a solid answer with modified code for SAX – maneesh Aug 23 '12 at 06:57