0

I have a 4.6 million line XML file. Each line represents an XML string. I wrote a simple app to shred the XML into a pipe delimited file. Over time my app gets slow, writing out a fraction of the number of lines per minute.

Can anyone review my code and give suggestions to speed it up?

string line;
string OutputFilepath = System.IO.Path.GetDirectoryName(txtSourceFile.Text);
string NewFileName = OutputFilepath + string.Format(@"\Results{0}.txt", DateTime.Now.Ticks);

using (System.IO.StreamWriter OutputFile = new System.IO.StreamWriter(NewFileName, true))
using (System.IO.StreamReader file = new System.IO.StreamReader(txtSourceFile.Text))
{
    XElement Stream;
    while ((line = file.ReadLine()) != null)
    {
        //Remove carriage return line feeds.
        line = line.Replace("
", "");
        line = line.Replace("
", "");

        //Create pipe delimited file.
        Stream = XElement.Parse(line);
        string PipeDelimited =
            (from el in Stream.Element("QUERY").Elements("ITEM")
             select
                 String.Format("{0}|{1}|{2}|{3}|{4}|{5}|{6}|{7}|{8}|{9}|{10}|{11}|{12}|{13}|{14}|{15}|{16}|{17}|{18}",
                     Text = "",
                     Text = "",
                     Text = "",
                     Text = "",
                     (string)el.Attribute("unparsedname"),
                     Text = "",
                     (string)el.Attribute("addr1"),
                     Text = "",
                     (string)el.Attribute("city"),
                     (string)el.Attribute("state"),
                     (string)el.Attribute("postalcode"),
                     new RegionInfo((string)el.Attribute("countrycodeISO2")).ThreeLetterISORegionName,
                     Text = "",
                     Text = "01/01/" + (string)el.Attribute("dobyear"),
                     Text = "",
                     Text = "",
                     Text = "",
                     Text = "",
                     Text = "A"
                 )
            ).Single();
        {
            OutputFile.WriteLine(PipeDelimited);
        }
    }
    file.Close();
}
Aaron Hurst
  • 109
  • 2
  • 3
  • 13
  • Does output order matter? – Travis Acton Jan 26 '18 at 18:45
  • Output order does not matter, it just cannot repeat. – Aaron Hurst Jan 26 '18 at 18:46
  • How long are the lines returned by `file.ReadLine()`? Are any large enough to go on the [large object heap](https://stackoverflow.com/q/8951836)? Is it really a well-formed XML file or a large number of XML files concatenated together? – dbc Jan 26 '18 at 19:00
  • How large is the data being read? Could you read into an object, close your reader stream, then open your writer stream and write them out? Your speed issue could be related to having so much in the buffer of each stream at the same time. Also, splitting them will allow you to find out if the speed issue is caused by reading, writing, or doing both at the same time. – user7396598 Jan 26 '18 at 19:06
  • So if performance tanks then IO is usually the culprit. Have you opened up task manager and seen your putter pinned near the end of the process? You may consider separating out your IO functions from your CPU functions. Checkout the answer and his explanation on this question: https://stackoverflow.com/questions/20928705/read-and-process-files-in-parallel-c-sharp – Travis Acton Jan 26 '18 at 19:11
  • 1
    Have you profiled it? Can you provide a [mcve] -- i.e. a small sample of the XML in question? – dbc Jan 26 '18 at 19:14
  • Are these files reading / writing are local or network files? – Cinchoo Jan 26 '18 at 19:35
  • Hints to get started: (1) fix your naming convention, this is confusing for most developers; (2) get rid of that LINQ query, there's absolutely no reason for it; (3) a single line of text won't contain any end-of-line characters, get rid of the Replace; (4) the `Text = ""` should rather be just `""`; (5) replace `String.Format` with `StringBuilder`; (6) learn to use a profiler. – Ondrej Tucny Jan 26 '18 at 21:26
  • It might helpful if you gathered some telemetry and posted it, for instance the CPU and Memory usage graphs from Task Manager on Windows. Also, if you are doing this in Visual Studio, use the profiler to see which line is actually slowing things down. – Superbest Jan 26 '18 at 23:04
  • Also, codereview.stackexchange.com is a more suitable place for general code review. As for this case, simple workaround: Break the 4.6 mil file into smaller files (100k each), and then run it on one file at a time, and when done join all the results together? Also, try writing to a file instead of the console. – Superbest Jan 26 '18 at 23:06
  • How much space do you have on you disk? The slow time doesn't appear to be in the application. Open Task Manager and check for memory leak while application is running. If the memory usage doesn't increase over time then the issue is with the drive. – jdweng Jan 27 '18 at 09:01

0 Answers0