1

I have a XML file of size 35 GB. I have tried using SQL Server xquery and python lxml to parse this file. SQL have 2 GB limit on xml data type and python throws Memory Error.

So, I decide to use Vb.Net to parse the file using the code.

I used DataSet to read xml file to avoid complex xpath query. But, this is also throwing out of memory error.

Try

        Dim xmlFile As XmlReader
        xmlFile = XmlReader.Create("D:\wcproduction.xml", New XmlReaderSettings())
        Dim ds As New DataSet
        ds.ReadXml(xmlFile)
        Dim i As Integer
        For i = 0 To ds.Tables(0).Rows.Count - 1
            MsgBox(ds.Tables(0).Rows(i).Item(0).ToString)
        Next
    Catch ex As Exception
        MsgBox(ex.Message)
    End Try

This is sample xml data from the actual file.

<root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<xsd:schema targetNamespace="urn:schemas-microsoft-com:sql:SqlRowSet1" xmlns:schema="urn:schemas-microsoft-com:sql:SqlRowSet1" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:sqltypes="http://schemas.microsoft.com/sqlserver/2004/sqltypes" elementFormDefault="qualified">
    <xsd:import namespace="http://schemas.microsoft.com/sqlserver/2004/sqltypes" schemaLocation="http://schemas.microsoft.com/sqlserver/2004/sqltypes/sqltypes.xsd"/>
    <xsd:element name="wcproduction">
        <xsd:complexType>
            <xsd:sequence>
                <xsd:element name="api_st_cde" type="sqltypes:smallint" nillable="1"/>
                <xsd:element name="api_cnty_cde" type="sqltypes:smallint" nillable="1"/>
                <xsd:element name="api_well_idn" type="sqltypes:int" nillable="1"/>
                <xsd:element name="pool_idn" type="sqltypes:int" nillable="1"/>
                <xsd:element name="prodn_mth" type="sqltypes:smallint" nillable="1"/>
                <xsd:element name="prodn_yr" type="sqltypes:int" nillable="1"/>
                <xsd:element name="ogrid_cde" type="sqltypes:int" nillable="1"/>
                <xsd:element name="prd_knd_cde" nillable="1">
                    <xsd:simpleType>
                        <xsd:restriction base="sqltypes:char" sqltypes:localeId="1033" sqltypes:sqlCompareOptions="IgnoreCase IgnoreKanaType IgnoreWidth" sqltypes:sqlSortId="52">
                            <xsd:maxLength value="2"/>
                        </xsd:restriction>
                    </xsd:simpleType>
                </xsd:element>
                <xsd:element name="eff_dte" type="sqltypes:datetime" nillable="1"/>
                <xsd:element name="amend_ind" nillable="1">
                    <xsd:simpleType>
                        <xsd:restriction base="sqltypes:char" sqltypes:localeId="1033" sqltypes:sqlCompareOptions="IgnoreCase IgnoreKanaType IgnoreWidth" sqltypes:sqlSortId="52">
                            <xsd:maxLength value="1"/>
                        </xsd:restriction>
                    </xsd:simpleType>
                </xsd:element>
                <xsd:element name="c115_wc_stat_cde" nillable="1">
                    <xsd:simpleType>
                        <xsd:restriction base="sqltypes:char" sqltypes:localeId="1033" sqltypes:sqlCompareOptions="IgnoreCase IgnoreKanaType IgnoreWidth" sqltypes:sqlSortId="52">
                            <xsd:maxLength value="1"/>
                        </xsd:restriction>
                    </xsd:simpleType>
                </xsd:element>
                <xsd:element name="prod_amt" type="sqltypes:int" nillable="1"/>
                <xsd:element name="prodn_day_num" type="sqltypes:smallint" nillable="1"/>
                <xsd:element name="mod_dte" type="sqltypes:datetime" nillable="1"/>
            </xsd:sequence>
        </xsd:complexType>
    </xsd:element>
</xsd:schema>
<wcproduction xmlns="urn:schemas-microsoft-com:sql:SqlRowSet1">
    <api_st_cde>30</api_st_cde>
    <api_cnty_cde>5</api_cnty_cde>
    <api_well_idn>20178</api_well_idn>
    <pool_idn>10540</pool_idn>
    <prodn_mth>7</prodn_mth>
    <prodn_yr>1973</prodn_yr>
    <ogrid_cde>12437</ogrid_cde>
    <prd_knd_cde>G </prd_knd_cde>
    <eff_dte>1973-07-31T00:00:00</eff_dte>
    <amend_ind>N</amend_ind>
    <c115_wc_stat_cde>F</c115_wc_stat_cde>
    <prod_amt>53612</prod_amt>
    <prodn_day_num>99</prodn_day_num>
    <mod_dte>2015-04-07T07:31:00.173</mod_dte>
</wcproduction>
</root>

I need a solution that can read data from XML file of size 35 GB or more and transfer data to SQL Server database.

Answer: Since dataset object uses memory, it will bottleneck. So, try this solution Reading large XML file using XMLReader in VB.net

user10987050
  • 75
  • 10
  • Could you write an intermediate piece that breaks the xml file into more manageable (but still valid) xml chunks of smaller sizes? – Harvtronix Feb 01 '19 at 02:47
  • That will make the process cumbersome. Because, I have 5 files like this to process. I just want to load the data to sql server. – user10987050 Feb 01 '19 at 02:59
  • I get that, but you're exceeding the physical limitations of your computer. Likely the xml parsers are trying to parse and store the entire file in memory, or at least large parts of it, and are exceeding the allocated memory for the process, if not the system as a whole. The only other option might be to find an xml parser that is designed to work with files of that magnitude. I don't know of any off-hand though. – Harvtronix Feb 01 '19 at 03:06
  • Appreciate your intention to help. I am looking for a solution that use little memory and do more file operation. I just start reading this XMLReader async option. [link](https://learn.microsoft.com/en-us/dotnet/api/system.xml.xmlreader?redirectedfrom=MSDN&view=netframework-4.7.2#xmlreader_create) – user10987050 Feb 01 '19 at 03:36
  • Have you looked at using SSIS? SSIS is designed to handle/load LARGE data files and a quick Google shows it can handle Xml files as well https://www.google.com/search?q=ssis+import+xml&gws_rd=ssl – JayV Feb 01 '19 at 08:29
  • XML source cannot load the inline schema. it will be helpful, if you could help me fix this SSIS issue. – user10987050 Feb 01 '19 at 15:39
  • Answer is added at the end of this question. – user10987050 Feb 07 '19 at 14:41

0 Answers0