Opening multiple files on multiple thread

Question

I have to process, roughly 170.000 files, and would like to use multiple threads. The name of the files is sequential following the Year-Number format and are sorted by year in folders. (but they can be all in the same folder). Different years have different file counts. Files are small size, only a few (10<size<20) KB each.

The order in witch they are processed is indifferent, since the output of the processing task is going to stored in a SQL Database. What would be the best way to this? Without opening the same file twice?

What have you tried? Use `Directory.GetFiles` to get a list of the files, and then use "parallel for" or whatever you like to iterate over then. — CodesInChaos, Jan 14 '13 at 16:12
I have tried creating a new thread and passing the start file name and the count number of file I like that thread to do... but it isn't very dinamic... is there an easier way? — rukinhas, Jan 14 '13 at 16:15
I've asked what would be the best way to this? Without opening the same file twice? — rukinhas, Jan 14 '13 at 16:16
Why do you want to use multiple threads? Multiple threads generally lend themselves to CPU bound tasks. If you have multiple threads trying to open and read multiple files won't you just cause your hard disk to thrash? — Daniel Kelley, Jan 14 '13 at 16:31
I would like to speed up the job... since the files are small and I'll have to write to output of same processing to a network database, i thought i could have a speed gain using threads! — rukinhas, Jan 14 '13 at 16:43
@rukinhas More threads doesn't necessarily mean more speed. However, it seems you have a number of answers so I guess you are in a good position to put that to the test. — Daniel Kelley, Jan 14 '13 at 16:53

score 1 · Answer 1 · edited May 23 '17 at 11:59

1

One of the possible solution would be to use producer / consumer design patter.

Your producer would get a list of files and feed some ProducerConsumer queue. Your consumer would process a file (or the file path) taken from the queue and process it (insert into your database). With that approach every file would be processed only once.

Problem of ProducerConsumer queue is described in C# producer/consumer SO question.

EDIT

However, the task might get complicated e.g.

What happen if one of the existing files will change. Do you have to update the database with the new file content? If so, you would have to have a mechanism of "markers", saying that file has changed (file last update date could work in some cases)
What happen if during the process new files are added. etc.

edited May 23 '17 at 11:59

Community

1
1

answered Jan 14 '13 at 16:18

Tom

26,212
21
100
111

Hello, thanks for your answer. The files won't change. If new files are added i can run the whole job again or I was thinking of saving the file names of file processed, so that I didn't have to do it again. Thanks for the help – rukinhas Jan 14 '13 at 16:30
@rukinhas: In that case producer / consumer design patter should work fine for you. Please bear in mind that number of processing threads (consumers) should not be too hight (it is not possible to give you optimal number of threads as the number depends on hardware, OS etc.). Otherwise you can loose some performance of e.g. context switching. – Tom Jan 14 '13 at 16:46

score 0 · Answer 2 · answered Jan 14 '13 at 16:19

I'd say 1 thread per year. Each 'Year Thread' reads the files that start with that year number, and reads them sequentally. As for going to the database i'd suggest you either

If all goes to a single table, remove indexes so no index locking occurs, and later create the indes
If you can't remove the indexes, at least use row locking, and a wait period for transanctions before timing out (two or more threads may be inserting at the same time)

Another solution, would be for the threads to generate the insert statement to a file and then execute that file to do the inserts or you could use a bulk insert tool. But this depends on the table structure and your DBMS

I have only three years, starting 2010 and ending in 2012, so only 3 threads seems little to me... but since this is IO based maybe I'll get a bottleneck in IO if more threads are added... Thanks for the help — rukinhas, Jan 14 '13 at 16:35

Tommaso Belluzzo · Accepted Answer · 2013-01-14T16:53:38.283

Here is a little example:

public static class FilesProcessor
{
    private static List<FileProcessor> m_FileProcessors;

    public static void Start()
    {
        m_FileProcessors = new List<FileProcessor>();

        for (Int32 year = 2005; year < DateTime.Now.Year; ++year)
            InstanciateFileProcessor(year);

        while (!FinishedLoading())
            Application.DoEvents();
    }

    public static void Stop()
    {
        foreach (FileProcessor processor in m_FileProcessors)
            processor.Stop()

        m_FileProcessors.Clear();
        m_FileProcessors = null;
    }

    private static Boolean FinishedLoading()
    {
        foreach (FileProcessor processor in m_FileProcessors)
        {
            if (processor.IsAlive() && !processor.FinishedLoading())
                return false;
        }

        return true;
    }

    private static void InstanciateFileProcessor(Int32 year)
    {
        FileProcessor processor = new FileProcessor(year);
        processor.Start();

        m_FileProcessors.Add(processor);
    }
}

Then the FileProcessor class:

public sealed class FileProcessor
{
    private Int32 m_Year;

    public Boolean IsAlive()
    {
        return ((m_Thread != null) && m_Thread.IsAlive);
    }

    public Boolean FinishedLoading()
    {
        return ((m_Thread == null) || m_Thread.Join(10));
    }

    public FileProcessor(Int32 year)
    {
        m_Year = year;

        m_Thread = new Thread(Load);
        m_Thread.Name = "Background File Processor";
    }

    public void Start()
    {
        if (m_Thread != null)
            m_Thread.Start();
    }

    public void Stop()
    {
        if ((m_Thread != null) && m_Thread.IsAlive)
            m_Thread.Abort();
    }

    private void Load()
    {
        // Browse the Year folder...
        // Get and read all fines one by one...
    }
}

thanks for the help. Correct me if I'm worng, but your code, only allows one thread per year right? — rukinhas, Jan 14 '13 at 16:41
Yes... but you can easily edit it to allow one thread per file or whatever your needs are! For example, you can modify it like so: InstanciateFileProcessor(year, number), and so on... — Tommaso Belluzzo, Jan 14 '13 at 16:43

PeteH · Answer 4 · 2013-01-14T16:41:07.730

I can see two possible approaches here.

First, split your problem into two. 1 - work out what to process, 2 - do the processing. Part 1 probably has to run on its own so you end up with a 100% accurate list of what needs processing. Then you can implement fancy (or not very fancy) logic as regards splitting the list and introducing multiple threads.

Second, do something similar to what @CarlosGrappa suggests. So essentially you create each thread with its own "pre-programmed" filter. It could be the year, as Carlos suggests. Or, you could create 24 threads, one for each hour of the file timestamp. Or 60 threads, each looking at a particular minute past the hour. It can basically be anything which gives you a definite criterion for (a) splitting the load as evenly as possible, and (b) for guaranteeing that a data file is processed once-and-only-once.

Clearly the second of these approaches would run more quickly, but you'd have to put some extra thought into how you split the files up. With the first method, once you've got the full list, you could basically chuck 100, or 1000, or 10000 etc. files at a time at your processors without being overly smart about how you do it.

score 0 · Answer 5 · answered Jan 14 '13 at 16:47

0

What's wrong with using .Net's parallel class?

just pass a collection to the parallel foreach loop. .Net does all the assigning for you. You can also pass in a custom partitioner so you can use chunk partitioning. Chunk partitioning causes the threads to keep asking for more tasks. If you don't use chunk partitioning all the work will be pre-allocated leading to some performance hits when some tasks take longer than others (which might lead to some threads being idle while one thread still has work to do).

http://msdn.microsoft.com/en-us/library/dd460720.aspx

answered Jan 14 '13 at 16:47

bjoern

1,009
3
15
31

I was not aware of such method! I'll have to read about about and make some tests, but it seems to be a good solution to my problem. – rukinhas Jan 14 '13 at 18:47
this is by far the easiest solution. As long as you use .Net's parallel tools and collections inside the loop you should be fine. I just had to copy millions of files into certain locations based on some logic and I processed them in parallel using Parallel.ForEach. – bjoern Jan 14 '13 at 20:29

Opening multiple files on multiple thread

5 Answers5