28

What is the best way to shred XML data into various database columns? So far I have mainly been using the nodes and value functions like so:

INSERT INTO some_table (column1, column2, column3)
SELECT
Rows.n.value('(@column1)[1]', 'varchar(20)'),
Rows.n.value('(@column2)[1]', 'nvarchar(100)'),
Rows.n.value('(@column3)[1]', 'int'),
FROM @xml.nodes('//Rows') Rows(n)

However I find that this is getting very slow for even moderate size xml data.

Eddie Groves
  • 33,851
  • 14
  • 47
  • 48

8 Answers8

57

Stumbled across this question whilst having a very similar problem, I'd been running a query processing a 7.5MB XML file (~approx 10,000 nodes) for around 3.5~4 hours before finally giving up.

However, after a little more research I found that having typed the XML using a schema and created an XML Index (I'd bulk inserted into a table) the same query completed in ~ 0.04ms.

How's that for a performance improvement!

Code to create a schema:

IF EXISTS ( SELECT * FROM sys.xml_schema_collections where [name] = 'MyXmlSchema')
DROP XML SCHEMA COLLECTION [MyXmlSchema]
GO

DECLARE @MySchema XML
SET @MySchema = 
(
    SELECT * FROM OPENROWSET
    (
        BULK 'C:\Path\To\Schema\MySchema.xsd', SINGLE_CLOB 
    ) AS xmlData
)

CREATE XML SCHEMA COLLECTION [MyXmlSchema] AS @MySchema 
GO

Code to create the table with a typed XML column:

CREATE TABLE [dbo].[XmlFiles] (
    [Id] [uniqueidentifier] NOT NULL,

    -- Data from CV element 
    [Data] xml(CONTENT dbo.[MyXmlSchema]) NOT NULL,

CONSTRAINT [PK_XmlFiles] PRIMARY KEY NONCLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

Code to create Index

CREATE PRIMARY XML INDEX PXML_Data
ON [dbo].[XmlFiles] (Data)

There are a few things to bear in mind though. SQL Server's implementation of Schema doesn't support xsd:include. This means that if you have a schema which references other schema, you'll have to copy all of these into a single schema and add that.

Also I would get an error:

XQuery [dbo.XmlFiles.Data.value()]: Cannot implicitly atomize or apply 'fn:data()' to complex content elements, found type 'xs:anyType' within inferred type 'element({http://www.mynamespace.fake/schemas}:SequenceNumber,xs:anyType) ?'.

if I tried to navigate above the node I had selected with the nodes function. E.g.

SELECT
    ,C.value('CVElementId[1]', 'INT') AS [CVElementId]
    ,C.value('../SequenceNumber[1]', 'INT') AS [Level]
FROM 
    [dbo].[XmlFiles]
CROSS APPLY
    [Data].nodes('/CVSet/Level/CVElement') AS T(C)

Found that the best way to handle this was to use the OUTER APPLY to in effect perform an "outer join" on the XML.

SELECT
    ,C.value('CVElementId[1]', 'INT') AS [CVElementId]
    ,B.value('SequenceNumber[1]', 'INT') AS [Level]
FROM 
    [dbo].[XmlFiles]
CROSS APPLY
    [Data].nodes('/CVSet/Level') AS T(B)
OUTER APPLY
    B.nodes ('CVElement') AS S(C)

Hope that that helps someone as that's pretty much been my day.

Luke Girvin
  • 13,221
  • 9
  • 64
  • 84
Dan
  • 816
  • 9
  • 14
  • Great answer Dan, could you elaborate on how you map the xml nodes to columns in a table? – longhairedsi Nov 08 '11 at 16:59
  • Hey longhairedsi, I'm not sure that I've properly understood your question. Do you mean inserting specific parts of an XML document into table columns or as I have above inserting the entire document into a single column and then using a select statement and the .nodes and .value functions (and the xpath syntax) to get the data out in a tabular format? – Dan Nov 09 '11 at 14:02
  • I think I may have understood your answer :) I thought you were adding the xml to a column as a temporary location before shredding to existing table columns (e.g INSERT INTO xxx FROM XmlFiles.Data) or such like. I was looking for the most performant way to shread xml to existing columns, I think with a little work your solution would do that job nicely. – longhairedsi Nov 09 '11 at 17:37
  • 1
    Sadly if your using an Azure database, they do not seem to support XML indexes. (http://msdn.microsoft.com/en-us/library/windowsazure/ff394115.aspx 'Typed XML and XML indexing is not supported.') – Kyle Feb 15 '13 at 19:26
  • Azure does now support XML indexes. I believe it was added in late 2015 or early 2016 but 'when' doesn't matter much. I really just didn't want the 'last word' here to be that it doesn't! – bielawski Nov 17 '17 at 13:09
  • I'm a little late to the party here, but this interests me Dan. Im a little naïve and a little confused. I don't see where you're loading the xml data. I see you loading the xsd file. Im assuming its either implicit in what you've written above or youre taking the actual data file as an xsd and loading it as a schema (if that's a thing). What am I missing here? I apologize for my naïvte.... – Dan Oct 19 '18 at 15:13
  • Dan, it’s been a while since I messed with SQL Server, but from memory you just insert the XML into the column you have created as text - one document per table row. The schema definition defines what XML is allowed in that field (the insert will fail if the XML doesn’t validate according to the schema). The schema also helps SQLsever build the index and allows for you to rapidly break down the contents of the XML vs an “untyped” column. Does that make more sense? – Dan Oct 21 '18 at 20:20
  • I think so. So you haven't listed up there an insert statement for the xml but its what you are doing somewhere else and instead of just inserting the document as an xml object, the schema makes it faster and more efficient to load it, in addition to if and when you query it, start moving values from the record around into new tables etc,? – Dan Oct 22 '18 at 12:18
5

in my case i'm running SQL 2005 SP2 (9.0).

The only thing that helped was adding OPTION ( OPTIMIZE FOR ( @your_xml_var = NULL ) ). Explanation is on the link below.

Example:

INSERT INTO @tbl (Tbl_ID, Name, Value, ParamData)
SELECT     1,
    tbl.cols.value('name[1]', 'nvarchar(255)'),
    tbl.cols.value('value[1]', 'nvarchar(255)'),
    tbl.cols.query('./paramdata[1]')
FROM @xml.nodes('//root') as tbl(cols) OPTION ( OPTIMIZE FOR ( @xml = NULL ) )

https://connect.microsoft.com/SQLServer/feedback/details/562092/an-insert-statement-using-xml-nodes-is-very-very-very-slow-in-sql2008-sp1

jccprj
  • 161
  • 2
  • 5
3

I'm not sure what is the best method. I used OPENXML construction:

INSERT INTO Test
SELECT Id, Data 
FROM OPENXML (@XmlDocument, '/Root/blah',2)
WITH (Id   int         '@ID',
      Data varchar(10) '@DATA')

To speed it up, you can create XML indices. You can set index specifically for value function performance optimization. Also you can use typed xml columns, which performs better.

Espo
  • 41,399
  • 21
  • 132
  • 159
aku
  • 122,288
  • 32
  • 173
  • 203
3

We had a similar issue here. Our DBA (SP, you the man) took a look at my code, made a little tweak to the syntax, and we got the speed we had been expecting. It was unusual because my select from XML was plenty fast, but the insert was way slow. So try this syntax instead:

INSERT INTO some_table (column1, column2, column3)
    SELECT 
        Rows.n.value(N'(@column1/text())[1]', 'varchar(20)'), 
        Rows.n.value(N'(@column2/text())[1]', 'nvarchar(100)'), 
        Rows.n.value(N'(@column3/text())[1]', 'int')
    FROM @xml.nodes('//Rows') Rows(n) 

So specifying the text() parameter really seems to make a difference in performance. Took our insert of 2K rows from 'I must have written that wrong - let me stop it' to about 3 seconds. Which was 2x faster than the raw insert statements we had been running through the connection.

edhubbell
  • 2,218
  • 1
  • 16
  • 17
2

I wouldn't claim this is the "best" solution, but I've written a generic SQL CLR procedure for this exact purpose - it takes a "tabular" Xml structure (such as that returned by FOR XML RAW) and outputs a resultset.

It does not require any customization / knowledge of the structure of the "table" in the Xml, and turns out to be extremely fast / efficient (although this wasn't a design goal). I just shredded a 25MB (untyped) xml variable in under 20 seconds, returning 25,000 rows of a pretty wide table.

Hope this helps someone: http://architectshack.com/ClrXmlShredder.ashx

Tao
  • 13,457
  • 7
  • 65
  • 76
1

This isn't an answer, more an addition to this question - I have just come across the same problem and I can give figures as edg asks for in the comment.

My test has xml which results in 244 records being inserted - so 244 nodes.

The code that I am rewriting takes on average 0.4 seconds to run.(10 tests run, spread from .56 secs to .344 secs) Performance is not the main reason the code is being rewritten, but the new code needs to perform as well or better. This old code loops the xml nodes, calling a sp to insert once per loop

The new code is pretty much just a single sp; pass the xml in; shred it.

Tests with the new code switched in show the new sp takes on average 3.7 seconds - almost 10 times slower.

My query is in the form posted in this question;

INSERT INTO some_table (column1, column2, column3)
SELECT
Rows.n.value('(@column1)[1]', 'varchar(20)'),
Rows.n.value('(@column2)[1]', 'nvarchar(100)'),
Rows.n.value('(@column3)[1]', 'int'),
FROM @xml.nodes('//Rows') Rows(n)

The execution plan appears to show that for each column, sql server is doing a separate "Table Valued Function [XMLReader]" returning all 244 rows, joining all back up with Nested Loops(Inner Join). So In my case where I am shredding from/ inserting into about 30 columns, this appears to happen separately 30 times.

I am going to have to dump this code, I don't think any optimisation is going to get over this method being inherently slow. I am going to try the sp_xml_preparedocument/OPENXML method and see if the performance is better for that. If anyone comes across this question from a web search (as I did) I would highly advise you to do some performance testing before using this type of shredding in SQL Server

DannykPowell
  • 1,227
  • 5
  • 18
  • 30
  • 1
    Interesting information here, but it's buried. If you're still about, post this as a new question (and answer it yourself if you found a good solution or identified the issue :-) –  Jan 12 '11 at 05:41
  • This itself isn't an answer but instead a affirmation of the original question. Please post your own question and instead link to your question through a comment to the original posters question. – jpierson Feb 18 '11 at 06:00
  • @pst Hi, yes still about. Thanks, this is as far as I needed to go with this so I've no need to repost it. – DannykPowell Feb 21 '11 at 09:41
  • 2
    @jpierson I posted this as an answer to help anyone passing this way with some more detailed information- this is the answer: "I don't think any optimisation is going to get over this method being inherently slow" I couldn't have posted this much info in a comment. This was work I was doing 2 years ago, I'm not going to be reposting it sorry – DannykPowell Feb 21 '11 at 09:44
0

There is an XML Bulk load COM object (.NET Example)

From MSDN:

You can insert XML data into a SQL Server database by using an INSERT statement and the OPENXML function; however, the Bulk Load utility provides better performance when you need to insert large amounts of XML data.

si618
  • 16,580
  • 12
  • 67
  • 84
  • 4
    I've been this down road, and I would not suggest it. Our biggest complaint was that XML Bulk Load did not play nice within transactional contexts. We spent too much time trying to get this to work, and in the end, it's the COM part of this that just isn't worth it. – Didaxis Feb 17 '12 at 14:39
0

My current solution for large XML sets (> 500 nodes) is to use SQL Bulk Copy (System.Data.SqlClient.SqlBulkCopy) by using a DataSet to load the XML into memory and then pass the table to SqlBulkCopy (defining a XML schema helps).

Obviously there a pitfalls such as needlessly using a DataSet and loading the whole document into memory first. I would like to go further in the future and implement my own IDataReader to bypass the DataSet method but currently the DataSet is "good enough" for the job.

Basically I never found a solution to my original question regarding the slow performance for that type of XML shredding. It could be slow due to the typed xml queries being inherently slow or something to do with transactions and the the SQL Server log. I guess the typed xml functions were never designed for operating on non-trivial node sizes.

XML Bulk Load: I tried this and it was fast but I had trouble getting the COM dll to work under 64bit environments and I generally try to avoid COM dlls that no longer appear to be supported.

sp_xml_preparedocument/OPENXML: I never went down this road so would be interested to see how it performs.

Eddie Groves
  • 33,851
  • 14
  • 47
  • 48
  • I think the most likely cause of the original problem was the lack of indexes. If, instead of querying that xml fragment directly, you put it in an xml typed column in a temp table, defined an index on that column and THEN queried it, it may make a big difference. – Amit Naidu Jun 11 '13 at 03:19