Process very large XML file

Question

I need to process an XML file with the following structure:

<FolderSizes>
    <Version></Version>
    <DateTime Un=""></DateTime>
    <Summary>
        <TotalSize Bytes=""></TotalSize>
        <TotalAllocated Bytes=""></TotalAllocated>
        <TotalAvgFileSize Bytes=""></TotalAvgFileSize>
        <TotalFolders Un=""></TotalFolders>
        <TotalFiles Un=""></TotalFiles>
    </Summary>
    <DiskSpaceInfo>
        <Drive Type="" Total="" TotalBytes="" Free="" FreeBytes="" Used=""
               UsedBytes=""><![CDATA[ ]]></Drive>
    </DiskSpaceInfo>
    <Folder ScanState="">
        <FullPath Name=""><![CDATA[ ]]></FullPath>
        <Attribs Int=""></Attribs>
        <Size Bytes=""></Size>
        <Allocated Bytes=""></Allocated>
        <AvgFileSz Bytes=""></AvgFileSz>
        <Folders Un=""></Folders>
        <Files Un=""></Files>
        <Depth Un=""></Depth>
        <Created Un=""></Created>
        <Accessed Un=""></Accessed>
        <LastMod Un=""></LastMod>
        <CreatedCalc Un=""></CreatedCalc>
        <AccessedCalc Un=""></AccessedCalc>
        <LastModCalc Un=""></LastModCalc>
        <Perc><![CDATA[ ]]></Perc>
        <Owner><![CDATA[ ]]></Owner>

        <!-- Special element; see paragraph below -->
        <Folder></Folder>
    </Folder>
</FolderSizes>

The <Folder> element is special in that it repeats within the <FolderSizes> element but can also appear within itself; I reckon up to about 5 levels.

The problem is that the file is really big at a whopping 11GB so I'm having difficulty processing it - I have experience with XML documents, but nothing on this scale.

What I would like to do is to import the information into a SQL database because then I will be able to process the information in any way necessary without having to concern myself with this immense, impractical file.

Here are the things I have tried:

Simply load the file and attempt to process it with a simple C# program using an XmlDocument or XDocument object
- Before I even started I knew this would not work, as I'm sure everyone would agree, but I tried it anyway, and ran the application on a VM (since my notebook only has 4GB RAM) with 30GB memory. The application ended up using 24GB memory, and taking very, very long, so I just cancelled it.
Attempt to process the file using an XmlReader object
- This approach worked better in that it didn't use as much memory, but I still had a few problems:
  - It was taking really long because I was reading the file one line at a time.
  - Processing the file one line at a time makes it difficult to really work with the data contained in the XML because now you have to detect the start of a tag, and then the end of that tag (hopefully), and then create a document from that information, read the info, attempt to determine which parent tag it belongs to because we have multiple levels... Sound prone to problems and errors
  - Did I mention it takes really long reading the file one line at a time; and that still without actually processing that line - literally just reading it.
Import the information using SQL Server
- I created a stored procedure using XQuery and running it recursively within itself processing the <Folder> elements. This went quite well - I think better than the other two approaches - until one of the <Folder> elements ended up being rather big, producing a An XML operation resulted an XML data type exceeding 2GB in size. Operation aborted. error. I read up about it and I don't think it's an adjustable limit.

Here are more things I think I should try:

Re-write my C# application to use unmanaged code
- I don't have much experience with unmanaged code, so I'm not sure how well it will work and how to make it as unmanaged as possible.
- I once wrote a little application that works with my webcam, receiving the image, inverting the colours, and painting it to a panel. Using normal managed code didn't work - the result was about 2 frames per second. Re-writing the colour inversion method to use unmanaged code solved the problem. That's why I thought that unmanaged might be a solution.
Rather go for C++ in stead of C#
- Not sure if this is really a solution. Would it necessarily be better that C#? Better than unmanaged C#?
- The problem here is that I haven't actually worked with C++ before, so I'll need to get to know a few things about C++ before I can really start working with it, and then probably not very efficiently yet.

I thought I'd ask for some advice before I go any further, possibly wasting my time.

Thanks in advance for you time and assistance.

EDIT

So before I start processing the file I run through it and check the size in a attempt to provide the user with feedback as to how long the processing might take; I made a screenshot of the calculation:

18 minutes in; 1.67mil lines

That's about 1500 lines per second; if the average line length is about 50 characters, that's 50 bytes per line, that's 75 kilobytes per second, for an 11GB file should take about 40 hours, if my maths is correct. But this is only stepping each line. It's not actually processing the line or doing anything with it, so when that starts, the processing rate drops significantly.

This is the method that runs during the size calculation:

    private int _totalLines = 0;
    private bool _cancel = false; // set to true when the cancel button is clicked

    private void CalculateFileSize()
    {
        xmlStream = new StreamReader(_filePath);
        xmlReader = new XmlTextReader(xmlStream);

        while (xmlReader.Read())
        {
            if (_cancel)
                return;

            if (xmlReader.LineNumber > _totalLines)
                _totalLines = xmlReader.LineNumber;

            InterThreadHelper.ChangeText(
                lblLinesRemaining, 
                string.Format("{0} lines", _totalLines));

            string elapsed = string.Format(
                "{0}:{1}:{2}:{3}",
                timer.Elapsed.Days.ToString().PadLeft(2, '0'),
                timer.Elapsed.Hours.ToString().PadLeft(2, '0'),
                timer.Elapsed.Minutes.ToString().PadLeft(2, '0'),
                timer.Elapsed.Seconds.ToString().PadLeft(2, '0'));

            InterThreadHelper.ChangeText(lblElapsed, elapsed);

            if (_cancel)
                return;
        }

        xmlStream.Dispose();
    }

Still runnig, 27 minutes in :(

I think a reasonable expectation for streamed processing of XML is about 1Gb/minute. When you say it was taking "really long", you need to quantify this so we can see whether your expectations are reasonable. Of course you need to add to that the cost of doing something useful with the data, like loading it into a database. — Michael Kay, Dec 10 '13 at 09:36
Thank you for your feedback Michael, please see the updated post for the requested information. Does this answer your question? — that0th3rGuy, Dec 10 '13 at 12:49

score 2 · Answer 1 · edited May 23 '17 at 10:25

2

you can read an XML as a logical stream of elements instead of trying to read it line-by-line and piece it back together yourself. see the code sample at the end of this article

also, your question has already been asked here

edited May 23 '17 at 10:25

Community

1
1

answered Dec 10 '13 at 08:05

radai

23,949
10
71
115

Thank you for your response. I'm implementing and testing it now. – that0th3rGuy Dec 10 '13 at 09:03
Other people have indeed asked the question before, but as far as I know, none of them were dealing with 11GB files. Loading it into memory doesn't really work, processing it line by line doesn't really work. That's my issue. – that0th3rGuy Dec 10 '13 at 09:58
the code i linked to should process the xml element-by-element without loading all of it into RAM – radai Dec 10 '13 at 11:47
Thanks radai. I implemented it and I'm running it now. Please see the updated post for some additional information. – that0th3rGuy Dec 10 '13 at 12:55
You're not actually reading the file line by line: you're parsing the document node by node. But I'm sorry, I can't see why it's taking so long (I don't know the Microsoft .NET parser that well; most of my experience with large documents is with Java.) – Michael Kay Dec 10 '13 at 23:10
i have no idea why its taking that much time, like michael above most of my experience is with java (which would have performed much better). have you tried running this through a profiler? – radai Dec 11 '13 at 05:16
Thanks for all your input guys. I have not tried a profiler yet. I am currently attempting a semi-multi-threading approach which it going much faster (I reckon about 3.5MB/s, up from 75KB/s), but with the sheer amount of information its still taking long. – that0th3rGuy Dec 11 '13 at 07:00

Process very large XML file

1 Answers1