I'm downloading a file from a 3rd party server, like so:
Try
req = DirectCast(HttpWebRequest.Create("https://www.example.com/my.xml"), HttpWebRequest)
req.Timeout = 100000 '100 seconds
Resp = DirectCast(req.GetResponse(), HttpWebResponse)
reader = New StreamReader(Resp.GetResponseStream)
responseString = reader.ReadToEnd()
Catch ex As Exception
End Try
The file my.xml is 1.2GB and I'm getting the error "Exception of type 'System.OutOfMemoryException' was thrown." When I open Windows Task Manager I see memory usage is at just 70% of total available memory and IIS Worker Process is not growing in size to use full system memory. When I found this: https://learn.microsoft.com/en-us/archive/blogs/tom/chat-question-memory-limits-for-32-bit-and-64-bit-processes, so the 70% failure sounds about right.
So now I'm considering splitting the file in more manageable smaller chunks. However, how can I do this without creating separate files? Is there a way to load for example 100MB into memory each time (respecting XML node endings) or perhaps by reading X number of XML nodes each time?
When I Google on "Read large XML file from webserver without splitting in smaller chunks" I get nothing but file splitting tools.
UPDATE 1
Based on Lex Li's suggestion I searched and found this tutorial: https://learn.microsoft.com/en-us/dotnet/standard/linq/perform-streaming-transform-large-xml-documents
So I translated the code, which works as per the tutorial:
Private Shared Iterator Function StreamCustomerItem(ByVal uri As String) As IEnumerable(Of XElement)
Using reader As XmlReader = XmlReader.Create(uri)
Dim name As XElement = Nothing
Dim item As XElement = Nothing
reader.MoveToContent()
While reader.Read()
If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "Customer" Then
While reader.Read()
If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "Name" Then
name = TryCast(XElement.ReadFrom(reader), XElement)
Exit While
End If
End While
While reader.Read()
If reader.NodeType = XmlNodeType.EndElement Then Exit While
If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "Item" Then
item = TryCast(XElement.ReadFrom(reader), XElement)
If item IsNot Nothing Then
Dim tempRoot As XElement = New XElement("Root", New XElement(name))
tempRoot.Add(item)
Yield item
End If
End If
End While
End If
End While
End Using
End Function
Private Shared Sub Main()
Dim srcTree As IEnumerable(Of XElement) = From el In StreamCustomerItem("https://www.example.com/source.xml") Select New XElement("Item", New XElement("Customer", CStr(el.Parent.Element("Name"))), New XElement(el.Element("Key")))
Dim xws As XmlWriterSettings = New XmlWriterSettings()
xws.OmitXmlDeclaration = True
xws.Indent = True
Using xw As XmlWriter = XmlWriter.Create(HttpContext.Current.Server.MapPath("files\") + "Output.xml", xws)
xw.WriteStartElement("Root")
For Each el As XElement In srcTree
el.WriteTo(xw)
Next
xw.WriteEndElement()
End Using
End Sub
The example above transforms the source.xml in an output.xml, but all I want is to read product
nodes exactly as is (no transformation needed) and in such a way that it reads in individual nodes so I can process large XML files.
I tried to rewrite it so it extracts values from my XML just like the structure. First I tried just getting something ready from my xml file like so:
Private Shared Iterator Function StreamCustomerItem(ByVal uri As String) As IEnumerable(Of XElement)
Using reader As XmlReader = XmlReader.Create(uri)
Dim name As XElement = Nothing
Dim item As XElement = Nothing
reader.MoveToContent()
While reader.Read()
If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "Id" Then
name = TryCast(XElement.ReadFrom(reader), XElement)
item = TryCast(XElement.ReadFrom(reader), XElement)
If item IsNot Nothing Then
Dim tempRoot As XElement = New XElement("Root", New XElement(name))
tempRoot.Add(item)
Yield item
End If
Exit While
End If
End While
End Using
End Function
Private Shared Sub Main()
Dim srcTree As IEnumerable(Of XElement)
srcTree = From el In StreamCustomerItem("https://www.example.com/mysource.xml")
Select New XElement("product", New XElement("product", CStr(el.Parent.Element("Id"))))
Dim xws As XmlWriterSettings = New XmlWriterSettings()
xws.OmitXmlDeclaration = True
xws.Indent = True
Using xw As XmlWriter = XmlWriter.Create(HttpContext.Current.Server.MapPath("files\") + "Output.xml", xws)
xw.WriteStartElement("Root")
For Each el As XElement In srcTree
el.WriteTo(xw)
Next
xw.WriteEndElement()
End Using
End Sub
That just writes <Root />
to my output.xml though
mysource.xml
<?xml version="1.0" encoding="UTF-8" ?>
<products>
<product>
<Id>
<![CDATA[122854]]>
</Id>
<Type>
<![CDATA[restaurant]]>
</Type>
<features>
<wifi>
<![CDATA[included]]>
</wifi>
</features>
</product>
</products>
So to summarize my question: how can I read individual product
nodes as-is from "mysource.xml" without loading the full file into memory?
UPDATE 1
Private Shared Iterator Function StreamCustomerItem(ByVal uri As String) As IEnumerable(Of XElement)
Using reader As XmlReader = XmlReader.Create(uri)
Dim name As XElement = Nothing
Dim item As XElement = Nothing
reader.MoveToContent()
While Not reader.EOF
If reader.NodeType = XmlNodeType.Element AndAlso reader.Name = "product" Then
Dim el As XElement = TryCast(XElement.ReadFrom(reader), XElement)
If el IsNot Nothing Then Yield el
Else
reader.Read()
End If
End While
End Using
End Function
Private Shared Sub Main()
Dim element As IEnumerable(Of XmlElement) = From el In StreamCustomerItem("source.xml") Select el
For Each str As XmlElement In grandChildData
'here loop through `product` element
Console.WriteLine(str)
Next
End Sub
My full test file via Onion Share (use TOR browser to download):
http://jkntfybog2s5cc754sn7mujvyaawdqxd4q5imss66x3hsos34rrbjrid.onion Key: YLTDQSDHTBWGDGQ6FIADTN2K7GFOFT5R7SFKWKTDER3WETD7EMKA