0

I have a list of files (which size can be between 3KB and more than 500MB) which I need to parse.

In order to do it faster, I would like to use Parallel.ForEach instruction to iterate on my list of files.

I know I can use:

Parallel.ForEach(files, new ParallelOptions { MaxDegreeOfParallelism = 2 }, file =>
{
    //Do stuff
});

In order to make sure only two files are processed at the same time. However, in the case where the two files are 500MB+, I am getting an out of memory exception.

Do you know if there is a way in C# to limit the ParallelOptions using a boolean. Ideally, I would like to process as many files as possible when the total processed files size is below 1GB (Or wait until processed files are done)

I was also thinking of ordering my list of files (by size) and take the first one with the last one in a Parallel foreach loop (assuming the total size is below 1GB). But once again, I am not sure:

  1. If this is possible
  2. What would be the syntax

As far as I understood, the Parallel.ForEach is iterating through the list taking the given order (In that case, it would be impossible to specify how to iterate through my list...)

Any advise on how you would do is appreciated.


Edit1:

Here is the code of how I read my files:

I need to start reading from a specific node: "RootElt" - that's why I don't use File.ReadAllText()

using (XmlReader reader = XmlReader.Create(fi.FullName))
{
    reader.ReadToDescendant("RootElt");
    return reader.ReadOuterXml();
}
return string.Empty;

NB: I was initially using XDocument and simply do: doc.Load() but this was causing an out of memory exception (even if I process the files one by one) - which is not the case using the XmlReader solution

Once read, I call my deserialize method:

private T Deserialize<T>(string xml)
{
    using (TextReader reader = new StringReader(xml))
    {
        XmlSerializer serializer = new XmlSerializer(typeof(T));
        var report = serializer.Deserialize(reader);
        return (T)report;
    }
 }
WilliamW
  • 438
  • 7
  • 18
  • 1
    How did you parse your files? – Pavel Anikhouski Dec 30 '19 at 16:14
  • 1
    We need a [mcve]. What are you doing that is producing out-of-memory-exceptions? If your code is leaking memory, that needs to be addressed. –  Dec 30 '19 at 16:15
  • The `XmlReader` and `XmlSerializer` classes are known to leak memory when misused. Please share some code with us. And can you confirm that your application is running as 64-bit? –  Dec 30 '19 at 16:17
  • Sure, please see my edit1 on my initial post – WilliamW Dec 30 '19 at 16:19
  • 1
    When you create an `XmlSerializer`, it creates a cached assembly for the passed-in type (if it hasn't already). If you do that in a multithreaded environment, you could very easily be producing multiple copies of the same dynamic assembly, which are *not* garbage collected. I strongly suspect this is the source of your memory leak. –  Dec 30 '19 at 16:25
  • Yes I just saw it [here](https://stackoverflow.com/questions/23897145/memory-leak-using-streamreader-and-xmlserializer) after you mentioned XmlSerializer can memory leak... – WilliamW Dec 30 '19 at 16:26
  • Second, you're using a `XmlReader` to give you the text content of the file. Just use `File.ReadAllText` instead. I don't think this is a leak, but you're reading a file into memory, then converting that to a string, thus doubling the needed memory. `File.ReadAllText` will read a single copy into memory. –  Dec 30 '19 at 16:29
  • 1
    It might be enough to simply add a `lock` around where you create the serializer. That should be sufficient to ensure you aren't creating multiple copies of the dynamic serialization assemblies. However, in my opinion, the I/O code isn't really parallelizable. You might do better with the [Task Parallel Library: Dataflow](https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/dataflow-task-parallel-library) –  Dec 30 '19 at 16:35
  • I will first try to cache the instance of xmlserializer as you mentioned it is leaking. And let you know if that works fine once done! – WilliamW Dec 30 '19 at 16:36
  • You might also consider using `XDocument` instead. It's easier to work with, generally. –  Dec 30 '19 at 16:39
  • Have you considered [task parallelism](https://www.tutorialspoint.com/data-parallelism-vs-task-parallelism) instead of data parallelism? In other words having one thread reading the xml files one by one, and another thread processing selected parts of the files. This is usually implemented with the producer-consumer pattern, either with the [`BlockingCollection`](https://learn.microsoft.com/en-us/dotnet/api/system.collections.concurrent.blockingcollection-1) class or the [TPL Dataflow](https://www.nuget.org/packages/Microsoft.Tpl.Dataflow) library. – Theodor Zoulias Dec 30 '19 at 18:09
  • Indeed, that can help me. I will have a look at the doc first as I am not very familiar with TPL Dataflow library. Thank you – WilliamW Dec 31 '19 at 07:55

1 Answers1

1

You can use the follow multithread construction:

public class FileProcessor
{
    private const long TotalSizeMax = 1073741824; // 1 GB
    private static long _totalSizeCurrent;

    public void ProcFiles(IList<FileInfo> fiList)
    {
        var totalFiles = fiList.Count;
        var index = 0;
        while (totalFiles > index)
        {
            var fi = fiList[index];
            Monitor.Enter(_totalSizeCurrent);
            var totalCandidate = _totalSizeCurrent + fi.Length;
            if (totalCandidate > TotalSizeMax)
            {
                Monitor.Exit(_totalSizeCurrent);
                Task.Delay(2000).Wait(); // delay 2 seconds
                continue;
            }
            _totalSizeCurrent = totalCandidate;
            Monitor.Exit(_totalSizeCurrent);
            Task.Run(() =>
            {
                // Start parse FileInfo fi
                //...

                // End parse
                Monitor.Enter(_totalSizeCurrent);
                _totalSizeCurrent -= fi.Length;
                Monitor.Exit(_totalSizeCurrent);
            });

            index++;
        }
    }
}
Adam Shakhabov
  • 1,194
  • 2
  • 14
  • 35
  • 1
    That's (I guess) what I was asking for. I manage to bypass my issue by reading only elements and tags which were of interest (filtering the content instead of reading everything in the XML...) - I will approve the question as it answers my initial question. Thank you – WilliamW Dec 31 '19 at 13:52