0

I have a large XML file that I need to import and parse into tabular structure ("flatten") in SQL Server. By "large" I mean a file that is around 450 MB, contains up to 6-7 nested levels and lots of elements, ~300.

I tried parsing the file both with OPENXML and Xml.Nodes. Both of the methods are slow. A partial query which reads a parent element and it's nested grandchildren takes several minutes if not dozens to run.

I tried using the SQLXML Bulk Load method. Unfortunately I couldn't - because the file isn't structured properly. There is an element which is logically a parent element which isn't nested as a parent physically.

Do you think the only posiblle solution left is to use .NET or Java? Is there something I'm missing?

I would strongly prefer a dynamic solution, to some degree. I don't want the SQL Server developers to relay on a procedural, compiled, code that they have no control/knowledge about - in the event that some changes will occur (in the XML structure).

Thank you very much.

Ido Gal
  • 528
  • 10
  • 26
  • are you flattening your XML to a tabular structure? – Santiago Cepas May 27 '14 at 11:48
  • Indeed. Flattening to some degree - depending on the database structure that have yet to be determined completely. – Ido Gal May 27 '14 at 12:25
  • 1
    I would really like to know what execution times you had for OPENXML and XML.nodes respectively. – Mikael Eriksson May 27 '14 at 13:09
  • OK. Seems that I have missed something and used the Xml.Nodes() in a wrong way. After fixing the problem, the results are quite interesting. 4:42 seconds for the OPENXML query. 0:47 for a Nodes() query. Wow, quite a difference! Anyway, I built another query which takes one "grandfather" element from the 5th level, few from the 8th level and few from the 9th level. Took 1:08 minutes for the 9155 result rows that I got. No so bad, after all. Still - maybe a bit slow for extracting the whole XML? Comments are welcome. – Ido Gal May 27 '14 at 19:04
  • Can you show the query? There might be other things you could try out. For instance, using the parent axis can be a bad thing, also specifying the `text()` node in the values clause is a good thing for performance. – Mikael Eriksson May 27 '14 at 20:42
  • What other method can I use other then a "parent axis" in order to get the child elements of the parent? I used the text() function. – Ido Gal May 28 '14 at 20:00
  • 1
    You can stack `cross apply nodes()`. In your case it looks like the first call to nodes should shred on fifth level, the second on eights level and the final cross apply should be on level nine. http://stackoverflow.com/a/22216941/569436 – Mikael Eriksson May 28 '14 at 21:25
  • A question regarding using the parent axis. [cross apply xml query performs exponentially worse as xml document grows](http://stackoverflow.com/a/24199428/569436) – Mikael Eriksson Jun 13 '14 at 07:31
  • I already got to the same conclusion. Thanks. – Ido Gal Jun 15 '14 at 04:42

3 Answers3

1

OK. I created an XML Index on the XML data column. (Just a primary one for now).

A query that took ~4:30 minutes before takes ~9 seconds now! Seems that a table that stores the XML with a proper XML Index and the parsing the data with the xml.nodes() function are a feasible solution.

Thank you all.

Ido Gal
  • 528
  • 10
  • 26
0

Since you want a tabular structure, you could convert the XML to a CSV file (using this java or this .NET tool , or even an XSLT tranformation) and then perform a bulk insert.

Of course, all that depends on your XML being properly formed.

Community
  • 1
  • 1
Santiago Cepas
  • 4,044
  • 2
  • 25
  • 31
  • Thank you Santiago. I would prefer using a solution with better reliability, even building a .NET app. Using csv is asking for trouble. – Ido Gal May 27 '14 at 18:57
0

Well, first of all I don't really understand why you would use OpenXml to load the file. I am pretty sure that doing that will internally trigger a whole bunch of tests for validity according to OpenXml ISO Standard.

But - Xml.Nodes() (I assume that means the DOM way of loading data) - is by far the slowest way to load and parse Xml data. Consider instead a SAX approach using XmlReader or similar. I do realize that the article is from 2004 - but it still explains the stuff pretty well.

Jesper Lund Stocholm
  • 1,973
  • 2
  • 27
  • 49
  • OK. So you're saying that in order to get good performance I have to abandon SQL Server's XML engine(s) (and SQL code) and write .NET code? (integrating it afterwards in the CLR etc...) – Ido Gal May 28 '14 at 19:56
  • well, you did say that you were considering writing a .Net app instead, but maybe I misunderstood :-). But I did learn something new - I had never heard of the OPENXML keyword in SQL Server before your post, so I clearly misunderstood /that/ part. – Jesper Lund Stocholm May 30 '14 at 11:34