1

Currently I'm using the following code snippet to convert a .txt file with XML data to .CSV format. My question is this, currently this works perfectly with files that are around 100-200 mbs and the conversion time is very low (1-2 minutes max), However I now need this to work for much bigger files (1-2 GB's each file). Currently the program freezes the computer and the conversion takes about 30-40 minutes with this function. Not sure how I would proceed changing this function. Any help will be appreciated!

    string all_lines = File.ReadAllText(p);

    all_lines = "<Root>" + all_lines + "</Root>";
    XmlDocument doc_all = new XmlDocument();
    doc_all.LoadXml(all_lines);
    StreamWriter write_all = new StreamWriter(FILENAME1);
    XmlNodeList rows_all = doc_all.GetElementsByTagName("XML");

    foreach (XmlNode rowtemp in rows_all)
    {
        List<string> children_all = new List<string>();
        foreach (XmlNode childtemp in rowtemp.ChildNodes)
        {
            children_all.Add(Regex.Replace(childtemp.InnerText, "\\s+", " "));             
        }
        write_all.WriteLine(string.Join(",", children_all.ToArray()));
    }
    write_all.Flush();
    write_all.Close();

Sample Input::

 <XML><DSTATUS>1,4,7,,5</DSTATUS><EVENT> hello,there,my,name,is,jack,</EVENT>
     last,name,missing,above <ANOTHERTAG>3,6,7,,8,4</ANOTHERTAG> </XML>

 <XML><DSTATUS>1,5,7,,3</DSTATUS><EVENT>hello,there,my,name,is,mary,jane</EVENT>
     last,name,not,missing,above<ANOTHERTAG>3,6,7,,8,4</ANOTHERTAG></XML>

Sample Output::

1,4,7,,5,hello,there,my,name,is,jack,,last,name,missing,above,3,6,7,,8,4
1,5,7,,3,hello,there,my,name,is,mary,jane,last,name,not,missing,above,3,6,7,,8,4
sparta93
  • 3,684
  • 5
  • 32
  • 63
  • You want to parse a 2Gb file. It's going to take time. – Liam Jun 18 '15 at 13:04
  • Possible duplicate of [The most efficient way to parse Xml](http://stackoverflow.com/questions/606304/the-most-efficient-way-to-parse-xml) – Liam Jun 18 '15 at 13:07
  • 1
    Looking at sample input and output I'd say read it line by line **as text** and write everything except tags (`<` + text inside + `>`). Should be fastest way. Based on your last comment, concatenate lines until result contains ``. – Sinatr Jun 18 '15 at 13:14

4 Answers4

4

You need to take a streaming approach, as you're currently reading the entire 2Gb file into memory and then processing it. You should read a bit of XML, write a bit of CSV and keep doing that until you've processed it all.

A possible solution is below:

using (var writer = new StreamWriter(FILENAME1))
{
    foreach (var element in StreamElements(r, "XML"))
    {
        var values = element.DescendantNodes()
            .OfType<XText>()
            .Select(e => Regex.Replace(e.Value, "\\s+", " "));

        var line = string.Join(",", values);

        writer.WriteLine(line);
    }
}

Where StreamElements is inspired by Jon Skeet's streaming of XElements from an XmlReader in an answer to this question. I've made some changes to support your 'invalid' XML (as you have no root element):

private static IEnumerable<XElement> StreamElements(string fileName, string elementName)
{
    var settings = new XmlReaderSettings
    {
        ConformanceLevel = ConformanceLevel.Fragment
    };

    using (XmlReader reader = XmlReader.Create(fileName, settings))
    {
        while (reader.Read())
        {
            if (reader.NodeType == XmlNodeType.Element)
            {
                if (reader.Name == elementName)
                {
                    var el = XNode.ReadFrom(reader) as XElement;
                    if (el != null)
                    {
                        yield return el;
                    }
                }
            }
        }
    }
}
Community
  • 1
  • 1
Charles Mager
  • 25,735
  • 2
  • 35
  • 45
  • Just tested this, it is significantly faster in reading and writing, however the output is just 1 single line. Basically all the data in csv format together. – sparta93 Jun 18 '15 at 13:42
  • 1
    @sparta93 I've made a few edits since I first wrote it - check what you have matches what's there now. Specifically, it originally said `writer.Write` instead of `writer.WriteLine`. I've also removed another line from `StreamElements` that would cause you to miss the first line in your XML. – Charles Mager Jun 18 '15 at 13:44
  • Thank you so much for this, helped me a lot! – sparta93 Jun 18 '15 at 13:49
  • No problem. I'd be interested to know the timing difference with this vs what you had before (I'm too lazy to try and create a 200Mb and 2Gb XML file!). – Charles Mager Jun 18 '15 at 13:50
  • For a 1 GB .txt file with xml data, my old method took around ~25-30 minutes, this method takes ~45 seconds. What a change in performance! – sparta93 Jun 18 '15 at 13:52
  • Actually I just noticed something in the outputs, for lines in the input that are in the following format
    1,3,4,5,,
    2,5,6,7
    , the output is only 1,3,4,5,, for that line. It skips everything after . Any idea whats causing this?
    – sparta93 Jun 18 '15 at 14:04
  • OK, that's a little different. That's what's known as 'mixed content' where an element contains text as well as other elements. I'll update the answer to just return all text nodes within the element. I'm impressed by the performance difference! – Charles Mager Jun 18 '15 at 14:10
2

If you're prepared to consider a completely different way of doing it, download Saxon-EE 9.6, get an evaluation license, and run the following streaming XSLT 3.0 code:

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template name="main">
  <xsl:stream href="input.xml">
    <xsl:for-each select="*/*">
       <xsl:value-of select="*!normalize-space()" separator=","/>
       <xsl:text>&#xa;</xsl:text>
    </xsl:for-each>
  </xsl:stream>
</xsl:template>

</xsl:stylesheet>
Michael Kay
  • 156,231
  • 11
  • 92
  • 164
1

It freezes because of File.ReadAllText(p);

Do not read the complete file into memory. (This will first start swapping, then halt your CPU because no more memory is available)

Use a chunking approach: Read line by line, convert line by line, write line by line.

Use some lower level XML Reader class, not XmlDocument

DrKoch
  • 9,556
  • 2
  • 34
  • 43
  • Each input (from to can span either one line or multiple lines. I'm not sure what to use and how... – sparta93 Jun 18 '15 at 13:14
  • See http://stackoverflow.com/questions/606304/the-most-efficient-way-to-parse-xml for some details. – DrKoch Jun 18 '15 at 13:16
-2

There are two variants. First is to hide program-freeze, use BackgroundWorker for it. Second: read your text file string-by-string, use any Reader for it (Xml or any text\file). You can combine these variants.

Oxoron
  • 664
  • 1
  • 7
  • 26
  • This does not contain anywhere near enough detail. *use any reader* is also totally wrong, [each reader has pro's and con's](http://stackoverflow.com/a/606332/542251) if your looking for efficiency you need to choose your reader carefully – Liam Jun 18 '15 at 13:10