0

I've built a program that

  1. takes in a list of record data from a file
  2. parses and cleans up each record in a parsing object
  3. outputs it to an output file

So far this has worked on a single thread, but considering the fact that records can exceed 1 million in some cases, we want to implement this in a multi threading context. Multi threading is new to me in .Net, and I've given it a shot but its not working. Below I will provide more details and code:

Main Class (simplified):

public class MainClass
{
    parseObject[] parseObjects;
    Thread[] threads;
    List<InputLineItem> inputList = new List<InputLineItem>();
    FileUtils fileUtils = new FileUtils();

    public GenParseUtilsThreaded(int threadCount)
    {
        this.threadCount = threadCount;
        Init();
    }

    public void Init()
    {    
        inputList = fileUtils.GetInputList();
        parseObjects = new parseObject[threadCount - 1];
        threads = new Thread[threadCount - 1];
        InitParseObjects();
        Parse();       
     }

    private void InitParseObjects()
    {
//using a ref of fileUtils to use as my lock expression
        parseObjects[0] = new ParseObject(ref fileUtils); 
        parseObjects[0].InitValues();
        for (int i = 1; i < threadCount - 1; i++)
        {
            parseObjects[i] = new parseObject(ref fileUtils);
            parseObjects[i].InitValues();
        }
    }

    private void InitThreads()
    {
        for (int i = 0; i < threadCount - 1; i++)
        {
            Thread t = new Thread(new ThreadStart(parseObjects[0].CleanupAndParseInput));
            threads[i] = t;
        }
    }
    public void Parse()
    {
        try
        {   
            InitThreads();
            int objectIndex = 0;

            foreach (InputLineItem inputLineItem in inputList)
            {
                parseObjects[0].inputLineItem = inputLineItem;
                threads[objectIndex].Start();
                objectIndex++;
                if (objectIndex == threadCount)
                {
                    objectIndex = 0;
                    InitThreads(); //do i need to re-init the threads after I've already used them all once?
                }
            }

        }
        catch (Exception e)
        {
            Console.WriteLine("(286) The following error occured: " + e);
        }

    }


}
}

And my Parse object class (also simplified):

public class ParseObject
    {
        public ParserLibrary parser { get; set; }
        public FileUtils fileUtils { get; set; }
        public InputLineItem inputLineItem { get; set; }

        public ParseObject( ref FileUtils fileUtils)
        {
            this.fileUtils = fileUtils;
        }
       
        public void InitValues()
        {
            //relevant config of parser library object occurs here
        }  

        public void CleanupFields()
        {
            parser.Clean(inputLineItem.nameValue);
            inputLineItem.nameValue = GetCleanupUpValueFromParser();  
        }

        private string GetCleanupFieldValue()
        {
            //code to extract cleanup up value from parses
        }

        public void CleanupAndParseInput()
        {
            CleanupFields();
            ParseInput();
        }

        public void ParseInput()
        {

            try
            {
                parser.Parse(InputLineItem.NameValue);
            }
            catch (Exception e)
            {
            }
        
            try
            {
                lock (fileUtils)
                {
                    WriteOutputToFile(inputLineItem);
                }
            }
            catch (Exception e)
            {
                Console.WriteLine("(414) Failed to write to output: " + e);
            }
        }

        public void WriteOutputToFile(InputLineItem inputLineItem)
        {
          //writes updated value to output file
        }
    }

The error I get is when trying to run the Parse function, I get this message:

An unhandled exception of type 'System.AccessViolationException' occurred in GenParse.NET.dll
Attempted to read or write protected memory. This is often an indication that other memory is corrupt.

That being said, I feel like there's a whole lot more that I'm doing wrong here aside from what is causing that error.

I also have further questions:

  1. Do I create multiple parse objects and iteratively feed them to each thread as I'm attempting to do, or should I use one Parse object that gets shared or cloned across each thread?
  2. If, outside the thread, I change a value in the object that I'm passing to the thread, will that change reflect in the object passed to the thread? i.e, is the object passed by value or reference?
  3. Is there a more efficient way for each record to be assigned to a thread and its parse object than I am currently doing with the objectIndex iterator?

THANKS!

Glenncito
  • 902
  • 1
  • 10
  • 23
  • I'm going to guess that (1) ParserLibrary is implemented at least partially using unmanaged code, and (2) it's not threadsafe. – Matthew Watson Oct 23 '20 at 13:35
  • @MatthewWatson 1) Yep - its a C LIbrary. 2) What makes it not threadsafe, and how would I make it threadsafe? I saw a post that indicates that ParserLibrary should be the expression in the Lock function in order to achieve this. is that correct? – Glenncito Oct 23 '20 at 13:41
  • 1
    Anything that accesses `ParserLibrary` should be in a `lock` statement, yes. There are an unbounded number of ways that something could be not threadsafe, so it's not possible to say exactly why without studying the source code! – Matthew Watson Oct 23 '20 at 13:43
  • I only removed irrelevant code from the code I included so that it still paints the full picture. I didn't include the library because I didn't think it would be relevant to, and also I wouldn't be allowed to share it :) – Glenncito Oct 23 '20 at 13:58

1 Answers1

2

Do I create multiple parse objects and iteratively feed them to each thread as I'm attempting to do, or should I use one Parse object that gets shared or cloned across each thread?

You initialize each thread with new ThreadStart(parseObjects[0].CleanupAndParseInput) so all threads will share the same parse object. It is a fairly safe bet that the parse objects are not threadsafe. So each thread should have a separate object. Note that this might not be sufficient, if the parse library uses any global fields it might be non-threadsafe even when using separate objects.

If, outside the thread, I change a value in the object that I'm passing to the thread, will that change reflect in the object passed to the thread? i.e, is the object passed by value or reference?

Objects (i.e. classes) are passed by reference. But any changes to an object are not guaranteed to be visible in other threads unless a memoryBarrier is issued. Most synchronization code (like lock) will issue memory barriers. Keep in mind that any non-atomic operation is unsafe if a field is written an read concurrently.

Is there a more efficient way for each record to be assigned to a thread and its parse object than I am currently doing with the objectIndex iterator?

Using manual threads in this way is very old-school. The modern, easier, and probably faster way is to use a parallel-for loop. This will try to be smart about how many threads it will use and try to adapt chunk sizes to keep the synchronization overhead low.

        var items = new List<int>();

        ParseObject LocalInit()
        {
            // Do initalization, This is run once for each thread used
            return new ParseObject();
        }

        ParseObject ThreadMain(int value, ParallelLoopState state, ParseObject threadLocalObject)
        {
            // Do whatever you need to do
            // This is run on multiple threads
            return threadLocalObject;
        }

        void LocalFinally(ParseObject obj)
        {
            // Do Cleanup for each thread
        }

        Parallel.ForEach(items, LocalInit, ThreadMain, LocalFinally);

As a final note, I would advice against using multithreading unless you are familiar with the potential dangers and pitfalls it involves, at least for any project where the result is important. There are many ways to screw up and make a program that will work 99.9% of the time, and silently corrupt data the remaining 0.1% of the time.

JonasH
  • 28,608
  • 2
  • 10
  • 23
  • Hi @JonasH, Thank you taking the time to give this answer your caution is noted, though I would still like to test the approach you've suggested. I have 2 further questions: 1) Am I correct in my understanding that the Parallel For-loop approach you've suggested does not require me to still create the thread object at all? 2) Currently I have an array of ParseObjects, as many as the desired threadcount so that each thread has its own due to thread safety concerns. Would I need to do something similar in a parallel for loop, or can I use a singular object? – Glenncito Oct 26 '20 at 17:08
  • 1
    1) Yes, Parallel.For will use threadpool threads so no need to create any threads manually. 2) Use a LocalInit/LocalFinally to create thread-specific objects, that is what it is for. Check the example. – JonasH Oct 27 '20 at 07:29