4

I'm merging text files (.itf) with some logic which are located in a folder. When I compile it to 32bit (console application, .Net 4.6) everything works fine except that I get outofmemory exceptions if there are lots of data in the folders. Compiling it to 64bit would solve that problem but it is running super slow compared to the 32bit process (more than 15 times slower).

I tried it with BufferedStream and ReadAllLines, but both are performing very poorly. The profiler tells me that these methods use 99% of the time. I don't know were the problem is...

Here's the code:

private static void readData(Dictionary<string, Topic> topics)
{
    foreach (string file in Directory.EnumerateFiles(Path, "*.itf"))
    {
        Topic currentTopic = null;
        Table currentTable = null;
        Object currentObject = null;
        using (var fs = File.Open(file, FileMode.Open))
        {
            using (var bs = new BufferedStream(fs))
            {
                using (var sr = new StreamReader(bs, Encoding.Default))
                {
                    string line;
                    while ((line = sr.ReadLine()) != null)
                    {
                        if (line.IndexOf("ETOP") > -1)
                        {
                            currentTopic = null;
                        }
                        else if (line.IndexOf("ETAB") > -1)
                        {
                            currentTable = null;
                        }
                        else if (line.IndexOf("ELIN") > -1)
                        {
                            currentObject = null;
                        }
                        else if (line.IndexOf("MTID") > -1)
                        {
                            MTID = line.Replace("MTID ", "");
                        }
                        else if (line.IndexOf("MODL") > -1)
                        {
                            MODL = line.Replace("MODL ", "");
                        }
                        else if (line.IndexOf("TOPI") > -1)
                        {
                            var name = line.Replace("TOPI ", "");
                            if (topics.ContainsKey(name))
                            {
                                currentTopic = topics[name];
                            }
                            else
                            {
                                var topic = new Topic(name);
                                currentTopic = topic;
                                topics.Add(name, topic);
                            }
                        }
                        else if (line.IndexOf("TABL") > -1)
                        {
                            var name = line.Replace("TABL ", "");
                            if (currentTopic.Tables.ContainsKey(name))
                            {
                                currentTable = currentTopic.Tables[name];
                            }
                            else
                            {
                                var table = new Table(name);
                                currentTable = table;
                                currentTopic.Tables.Add(name, table);
                            }
                        }
                        else if (line.IndexOf("OBJE") > -1)
                        {
                            if (currentTable.Name != "Metadata" || currentTable.Objects.Count == 0)
                            {
                                var shortLine = line.Replace("OBJE ", "");
                                var obje = new Object(shortLine.Substring(shortLine.IndexOf(" ")));
                                currentObject = obje;
                                currentTable.Objects.Add(obje);
                            }
                        }
                        else if (currentTopic != null && currentTable != null && currentObject != null)
                        {
                            currentObject.Data.Add(line);
                        }
                    }
                }
            }
        }
    }
}
Thomas Ayoub
  • 29,063
  • 15
  • 95
  • 142
Chris
  • 234
  • 1
  • 11
  • So where is this `ReadAllLines` that the profiler says is slowing things down? Also, you bottleneck is likely due to `string.IndexOf`. Tip: Invest in a creating a proper lexer/parser. – leppie Sep 30 '15 at 08:38
  • I wonder if the amount of string allocations (all these calls to `.Replace` create new strings) are the culprit - a real profiler might tell, but I wonder if a mechanism that takes the whole file as a stream and reads character by character without ever reparsing/manipulating the line would be the better solution here. – Michael Stum Sep 30 '15 at 08:38
  • The code example shows the `BufferedStream` version. I have also one with `ReadAllLines`. In 32bit the profiler indeed says that the `Replace` and `IndexOf` methods consume a lot of time. However, I'm wondering why the 64bit version is so much slower. – Chris Sep 30 '15 at 08:49
  • im not sure why 64 bit is slower than 32 bit version. anyway ...each time you call `line.IndexOf` its going to read that line from start again. thats very time consuming. i suggest you implement your own method for finding index. – M.kazem Akhgary Sep 30 '15 at 08:49
  • @M.kazemAkhgary I'm not sure that you can do a lot better than the Microsoft version of IndexOf... – Thomas Ayoub Sep 30 '15 at 08:52
  • Can you post sample of text fie? – jdweng Sep 30 '15 at 08:57
  • @Thomas as the search elements are size of `4`. you can iterate inside string only once, taking characters `4` by `4` and checking them with search elements. then return an index plus the value which tells what was found. – M.kazem Akhgary Sep 30 '15 at 09:00
  • What tells you that the file is not like this: `SOMEGARBADGETHATISNOTAMULTIPLEOF4MODL` ? – Thomas Ayoub Sep 30 '15 at 09:02
  • How do you know 64-bit is slower than a non-working 32-bit version? The 64-bit version stores much more data in memory which might cause pagefile swapping and/or purge the disk cache. – adrianm Sep 30 '15 at 09:06
  • whats the matter. its just an string. you can use `dictionary` if the order of indexes matter. @Thomas – M.kazem Akhgary Sep 30 '15 at 09:11
  • @M.kazemAkhgary the matter is that the parser is not that easy to implement. Furthermore the OP is using it as `Contains` – Thomas Ayoub Sep 30 '15 at 09:12
  • well it is not easy for everyone! @Thomas – M.kazem Akhgary Sep 30 '15 at 09:15

4 Answers4

4

The biggest problem with your program is that, when you let it run in 64-bit mode, then it can read a lot more files. Which is nice, a 64-bit process has a thousand times more address space than a 32-bit process, running out of it is excessively unlikely.

But you do not get a thousand times more RAM.

The universal principle of "there is no free lunch" at work. Having enough RAM matters a great deal in a program like this. First and foremost, it is used by the file system cache. That magical operating system feature that makes it look like reading files from a disk is very cheap. It is not at all, one of the slowest things you can do in a program, but it is very good at hiding it. You'll invoke it when you run your program more than once. The second, and subsequent, times you won't read from the disk at all. That's a pretty dangerous feature and very hard to avoid when you test your program, you get very unrealistic assumptions about how efficient it is.

The problem with a 64-bit process is that it easily makes the file system cache ineffective. Since you can read a lot more files, thus overwhelming the cache. And getting old file data removed. Now the second time you run your program it will not be fast anymore. The files you read will not be in the cache anymore but must be read from the disk. You'll now see the real perf of your program, the way it will behave in production. That's a good thing, even though you don't like it very much :)

Secondary problem with RAM is the lesser one, if you allocate a lot of memory to store the file data then you'll force the operating system to find the RAM to store it. That can cause a lot of hard page faults, incurred when it must unmap memory used by another process, or yours, to free up the RAM that you need. A generic problem called "thrashing". Page faults is something you can see in Task Manager, use View > Select Columns to add it.

Given that the file system cache is the most likely source of the slow-down, a simple test you can do is rebooting your machine, which ensures that the cache cannot have any of the file data, then run the 32-bit version. With the prediction that it will also be slow and that BufferedStream and ReadAllLines are the bottlenecks. Like they should be.

One final note, even though your program doesn't match the pattern, you cannot make strong assumptions about .NET 4.6 perf problems yet. Not until this very nasty bug gets fixed.

Community
  • 1
  • 1
Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
1

A few tips:

  • Why do you use File.Open, then BufferedStream then StreamReader when you can do the job with just a StreamReader, which is buffered?
  • You should reorder your conditions with the one that happen the more often in first.
  • Consider read all lines then use Parallel.ForEach
Thomas Ayoub
  • 29,063
  • 15
  • 95
  • 142
  • thx for your hints, I implemented them. The parallelism though does not work in my case, I have to parse them in sequence because of the model of the content. – Chris Sep 30 '15 at 11:41
1

I could solve it. Seems that there is a bug in .Net compiler. Removing the code optimization checkbox in VS2015 lead to a huge performance increase. Now, it is running with a similar performance as the 32 bit version. My final version with some optimizations:

private static void readData(ref Dictionary<string, Topic> topics)
    {
        Regex rgxOBJE = new Regex("OBJE [0-9]+ ", RegexOptions.IgnoreCase | RegexOptions.Compiled);
        Regex rgxTABL = new Regex("TABL ", RegexOptions.IgnoreCase | RegexOptions.Compiled);
        Regex rgxTOPI = new Regex("TOPI ", RegexOptions.IgnoreCase | RegexOptions.Compiled);
        Regex rgxMTID = new Regex("MTID ", RegexOptions.IgnoreCase | RegexOptions.Compiled);
        Regex rgxMODL = new Regex("MODL ", RegexOptions.IgnoreCase | RegexOptions.Compiled);
        foreach (string file in Directory.EnumerateFiles(Path, "*.itf"))
        {
            if (file.IndexOf("itf_merger_result") == -1)
            {
                Topic currentTopic = null;
                Table currentTable = null;
                Object currentObject = null;
                using (var sr = new StreamReader(file, Encoding.Default))
                {
                    Stopwatch sw = new Stopwatch();
                    sw.Start();
                    Console.WriteLine(file + " read, parsing ...");
                    string line;
                    while ((line = sr.ReadLine()) != null)
                    {
                        if (line.IndexOf("OBJE") > -1)
                        {
                            if (currentTable.Name != "Metadata" || currentTable.Objects.Count == 0)
                            {
                                var obje = new Object(rgxOBJE.Replace(line, ""));
                                currentObject = obje;
                                currentTable.Objects.Add(obje);
                            }
                        }
                        else if (line.IndexOf("TABL") > -1)
                        {
                            var name = rgxTABL.Replace(line, "");
                            if (currentTopic.Tables.ContainsKey(name))
                            {
                                currentTable = currentTopic.Tables[name];
                            }
                            else
                            {
                                var table = new Table(name);
                                currentTable = table;
                                currentTopic.Tables.Add(name, table);
                            }
                        }
                        else if (line.IndexOf("TOPI") > -1)
                        {
                            var name = rgxTOPI.Replace(line, "");
                            if (topics.ContainsKey(name))
                            {
                                currentTopic = topics[name];
                            }
                            else
                            {
                                var topic = new Topic(name);
                                currentTopic = topic;
                                topics.Add(name, topic);
                            }
                        }
                        else if (line.IndexOf("ETOP") > -1)
                        {
                            currentTopic = null;
                        }
                        else if (line.IndexOf("ETAB") > -1)
                        {
                            currentTable = null;
                        }
                        else if (line.IndexOf("ELIN") > -1)
                        {
                            currentObject = null;
                        }
                        else if (currentTopic != null && currentTable != null && currentObject != null)
                        {
                            currentObject.Data.Add(line);
                        }
                        else if (line.IndexOf("MTID") > -1)
                        {
                            MTID = rgxMTID.Replace(line, "");
                        }
                        else if (line.IndexOf("MODL") > -1)
                        {
                            MODL = rgxMODL.Replace(line, "");
                        }
                    }
                    sw.Stop();
                    Console.WriteLine(file + " parsed in {0}s", sw.ElapsedMilliseconds / 1000.0);
                }
            }
        }
    }
Chris
  • 234
  • 1
  • 11
0

Removing the code optimization checkbox should typically result in performance slowdowns, not speedups. There may be an issue in the VS 2015 product. Please provide a stand-alone repro case with an input set to your program that demonstrate the performance problem and report at: http://connect.microsoft.com/