0

I am using the following script to insert a large XML file (3.5Gb) into my local SQL Server table 'files_index using powershell as the file is over the 2 GB limit for SQL.

Table structure is as follows and PowerShell script is after. The file contains around 5000000 rows but takes a long time to insert, whilst this works and does the job im looking for any way to speed it up (It currently takes around 15 mins).

Any suggestions for speeding up the process, I have tried plahing with the batch size but it doesn't seem to make much difference and I got this powershell script here at stackoverflow a while ago but I am just trying to streamline the process. Thanks for any assistance or suggestions.

CREATE TABLE [dbo].[files_index]
(
    [Product_ID] [int] NOT NULL,
    [path] [varchar](100) NULL,
    [Updated] [varchar](50) NULL,
    [Quality] [varchar](50) NULL,
    [Supplier_id] [int] NULL,
    [Prod_ID] [varchar](100) NULL,
    [Catid] [int] NULL,
    [On_Market] [int] NULL,
    [Model_Name] [varchar](250) NULL,
    [Product_View] [varchar](250) NULL,
    [HighPic] [varchar](250) NULL,
    [HighPicSize] [int] NULL,
    [HighPicWidth] [int] NULL,
    [HighPicHeight] [int] NULL,
    [Date_Added] [varchar](150) NULL,

    CONSTRAINT [PK_files_index] 
        PRIMARY KEY CLUSTERED ([Product_ID] ASC)
                WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, 
                      IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, 
                      ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

PowerShell script:

Set-ExecutionPolicy Unrestricted -scope LocalMachine

[String]$global:connectionString = "Data Source=Apps2\Apps2;Initial 
Catalog=DTEDATA;Integrated Security=SSPI";
[System.Data.DataTable]$global:dt = New-Object System.Data.DataTable;
[System.Xml.XmlTextReader]$global:xmlReader = New-Object 
System.Xml.XmlTextReader("C:\Scripts\icecat\files.index.xml");
[Int32]$global:batchSize = 100000;

Function Add-FileRow() {
    $newRow = $dt.NewRow();
    $null = $dt.Rows.Add($newRow);

    $newRow["Product_ID"] = $global:xmlReader.GetAttribute("Product_ID");
    $newRow["path"] = $global:xmlReader.GetAttribute("path");
    $newRow["Updated"] = $global:xmlReader.GetAttribute("Updated");
    $newRow["Quality"] = $global:xmlReader.GetAttribute("Quality");
    $newRow["Supplier_id"] = $global:xmlReader.GetAttribute("Supplier_id");
    $newRow["Prod_ID"] = $global:xmlReader.GetAttribute("Prod_ID");
    $newRow["Catid"] = $global:xmlReader.GetAttribute("Catid");
    $newRow["On_Market"] = $global:xmlReader.GetAttribute("On_Market");
    $newRow["Model_Name"] = $global:xmlReader.GetAttribute("Model_Name");
    $newRow["Product_View"] = $global:xmlReader.GetAttribute("Product_View");
    $newRow["HighPic"] = $global:xmlReader.GetAttribute("HighPic");
    $newRow["HighPicSize"] = $global:xmlReader.GetAttribute("HighPicSize");
    $newRow["HighPicWidth"] = $global:xmlReader.GetAttribute("HighPicWidth");
    $newRow["HighPicHeight"] = $global:xmlReader.GetAttribute("HighPicHeight");
    $newRow["Date_Added"] = $global:xmlReader.GetAttribute("Date_Added");
}

# init data table schema
$da = New-Object System.Data.SqlClient.SqlDataAdapter("SELECT * FROM 
files_index WHERE 0 = 1", $global:connectionString);
$null = $da.Fill($global:dt);

$bcp = New-Object System.Data.SqlClient.SqlBulkCopy($global:connectionString);
$bcp.DestinationTableName = "dbo.files_index";

$recordCount = 0;

while($xmlReader.Read() -eq $true)
{

    if(($xmlReader.NodeType -eq [System.Xml.XmlNodeType]::Element) -and 
($xmlReader.Name -eq "file"))
    {
        Add-FileRow -xmlReader $xmlReader;
        $recordCount += 1;
        if(($recordCount % $global:batchSize) -eq 0) 
        {
            $bcp.WriteToServer($dt);
            $dt.Rows.Clear();
            Write-Host "$recordCount file elements processed so far";
        }
    }

}

if($dt.Rows.Count -gt 0)
{
    $bcp.WriteToServer($dt);
}

$bcp.Close();
$xmlReader.Close();

Write-Host "$recordCount file elements imported ";

catch
{
    throw;
}
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
John Spencer
  • 380
  • 5
  • 14
  • You could define a table variable type that matched the table def and Bulk insert, however, I do not know how to do that in PowerShell :( – Ross Bush Mar 02 '19 at 05:04
  • Any time you are doing large anything, of course it will take time. If you have long running processes, the is when you look to use background jobs, parallel – postanote Mar 02 '19 at 05:16
  • Have you seen this article http://www.sqlservercentral.com/articles/powershell/65196/? IMHO the `Method 2: XML/XMLBulkLoad` section is promising. How to create an XML schema for an existing table: https://stackoverflow.com/a/17521467/3350428. – Andrei Odegov Mar 02 '19 at 06:39

1 Answers1

0

This is close enough to be tagged as a duplicate of the following...

Importing and parsing a large XML file in SQL Server (when “normal” methods are rather slow)

The accepted answer:

OK. I created an XML Index on the XML data column. (Just a primary one for now). A query that took ~4:30 minutes before takes ~9 seconds now! Seems that a table that stores the XML with a proper XML Index and the parsing the data with the xml.nodes() function are a feasible solution.

Improve performance of converting a large xml file (~300 MB) to relational table in SQL Server

Accepted Answer:

I had another look at this and can reproduce your issue. Try adding OPTION ( MAXDOP 1 ) to your query. In my test rig with a 300MB file this ran in 1 min 42 seconds. The unhinted version ran for 30 minutes at 100% CPU before I killed it. You could also have a look at OPENXML. People often say it's faster with large XML files and it appears to be in this case. However you should be aware of the known issues with OPENXML ( eg can take 1/8th of buffer pool, is an old-fashioned COM .dll, you must call sp_xml_removedocument etc ). Once you've researched the pros and cons of OPENXML, you could try something like this:

DECLARE @FileData XML

SELECT @FileData = BulkColumn
FROM OPENROWSET(BULK 'd:\temp\temp.xml', SINGLE_BLOB) AS x

DECLARE @hDoc int

EXEC sp_xml_preparedocument @hDoc OUTPUT, @FileData 

SELECT *
INTO #tmp
FROM OPENXML( @hDoc, '/Data/Entities/Entity/Attributes/Attribute/Values/Value', 1 ) 
WITH
    (
    Id VARCHAR(50) '../../../../@Id',
    Name VARCHAR(100) '../../../../@Name',
    AttributeName VARCHAR(100)  '../../@AttributeName',
    AttributeValue VARCHAR(MAX) '.'
    )

EXEC sp_xml_removedocument @hDoc

Honestly I try and avoid it these days due to those issues; what's the point of one query going faster when you've just lopped 1/8th of your buffer pool? Finally, the quickest and most scalable method (IMHO) is SSIS. This version ran in about 10 seconds for the same file as the above methods in my rig. SSIS XML Import Method Create a package, add a Data Flow task, add an XML Source and then each table. I created a 300MB file with the same structure as yours and it loaded in about 10 seconds, eg

postanote
  • 15,138
  • 2
  • 14
  • 25
  • Thanks for the reply, I havnt tried this but I believe as the file is over 2gb it wont load into sql as a single blob, I believe I tried that in the beginning, (its the only reason I went to powershell I would rather have kept it all in sql) – John Spencer Mar 07 '19 at 21:51