0

I have a lot of files with a lot of size. They all need to be parsed and it takes quite long time. So, I came up with a idea: While one thread reads a file(harddisk's reading speed is the bottle-neck here), the other thread should parse the required information from lines and while parsing takes place, the file reading thread should go for the next file and so on.

I would just create two threads one reads all the files by File.ReadAllLines and the other thread parses the arrays returned. However this would consume so much of memory. Thus I need to limit number of files read to 5 for example.

Another problem I'm having is waiting for the obtaining-lines process to be completed. The parsing thread should know whether there is a ready to parse array.

The question is, what way should I follow? Is there an example of that(I couldn't find any)? Or is there a better idea?

gunakkoc
  • 1,069
  • 11
  • 30
  • What does the parsing code look like? What are you doing with the data you parse? – Steve Dec 11 '12 at 12:09
  • I am parsing variables by checking each line of a output file of a program. Not all the lines are read once. because if some conditions are met, I have to go back to nth line. I am obtaining each info to a variable and displaying them in DataGridViews. – gunakkoc Dec 11 '12 at 12:17
  • Ok, perhaps you could post your sequential code so we have a better idea of what you are trying to do. Also, are some of the individual files very large, or could you read an entire file to a string array and process in memory, rather then reading line by line? – Steve Dec 11 '12 at 12:25

3 Answers3

1

If you are sure, that parsing is ALWAYS faster than reading, you can do it very simple: Thread A (simply a thread to not block the UI-Thread) reads the file and then starts a new thread B and passes the file content to it (using Tasks instead of thread makes it easier). Put it into a loop and you are done. Since the parsing is faster, the second thread/task will have finished before Thread A starts a new thread. So you will only have two threads running at the same time and 2 files in memory at the same time.

waiting for the obtaining-lines process to be completed. The parsing thread should know whether there is a ready to parse array.

Not sure if I understand it correctly, but that would be solved by the above "solution". Because you always start a new thread/task, WHEN and only when the file has been read completely.

UPDATE: If processing is NOT (always) faster than reading, your could do it for example like this:

Private MaxTasks As Integer = 4

Private Async Sub ReadAndProcess(ByVal FileList As List(Of String))

    Dim ProcessTasks As New List(Of Task)

    For Each fi In FileList
        Dim tmp = fi
        Console.WriteLine("Reading {0}", tmp)
        Dim FileContent = Await Task.Run(Of Byte())(Function() As Byte()
                                                        Return File.ReadAllBytes(tmp)
                                                    End Function)
        If ProcessTasks.Count >= MaxTasks Then
            Console.WriteLine("I have to wait!")
            Dim NextReady = Await Task.WhenAny(ProcessTasks)
            ProcessTasks.Remove(NextReady)
        End If

        Console.WriteLine("I can start a new process-task!")
        ProcessTasks.Add(Task.Run(Sub()
                                      Console.WriteLine("Processing {0}", tmp)
                                      Dim l As Long
                                      For Each b In FileContent
                                          l += b
                                      Next
                                      System.Threading.Thread.Sleep(2000)
                                      Console.WriteLine("Done with {0}", tmp)
                                  End Sub))
    Next

    Await Task.WhenAll(ProcessTasks)

End Sub

Private Async Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click

    Dim ofd As New OpenFileDialog
    ofd.Multiselect = True
    If ofd.ShowDialog = Windows.Forms.DialogResult.OK AndAlso ofd.FileNames.Count >= 1 Then
        ReadAndProcess(ofd.FileNames.ToList)
    End If

End Sub

The idea (which as usually could be implemented in 4 dozen ways in .Net) is simply, that you schedule new processing tasks until you reach your self-set limit. If that is reached you "wait" until a task becomes ready and start a new one.

UPDATE2: With TPL lib it might look like:

Private Sub Doit()

    Dim ABProcess As New ActionBlock(Of Tuple(Of String, Byte()))(Sub(tp)
                                                                      Console.WriteLine("Processing {0}", tp.Item1)
                                                                      Dim l As Long
                                                                      For Each el In tp.Item2
                                                                          l += el
                                                                      Next
                                                                      System.Threading.Thread.Sleep(1000)
                                                                      Console.WriteLine("Done with {0}", tp.Item1)
                                                                  End Sub, New ExecutionDataflowBlockOptions With {.MaxDegreeOfParallelism = 4, .BoundedCapacity = 4})

    Dim ABRead As New ActionBlock(Of String())(Async Sub(sarr)
                                                   For Each s In sarr
                                                       Console.WriteLine("Reading {0}", s)
                                                       Dim t = New Tuple(Of String, Byte())(s, File.ReadAllBytes(s))
                                                       Dim taken = Await ABProcess.SendAsync(t)
                                                       Console.WriteLine("Output taken = {0}", taken)
                                                   Next
                                                   Console.WriteLine("All reading done")
                                               End Sub)

    Dim ofd As New OpenFileDialog
    ofd.Multiselect = True
    If ofd.ShowDialog = Windows.Forms.DialogResult.OK Then
        ABRead.Post(ofd.FileNames)
    End If

End Sub

Which version is "nicer" ... might be personal taste ;) Personally I might prefer the "manual" version, because the new TPL Blocks are sometimes VERY blackboxish.

igrimpe
  • 1,775
  • 11
  • 12
1

I have a very similar application and use BlockingCollection

BlockingCollection Overview

In my case parse is faster than read but the problem I had was the files are not the same size so read could be waiting on parse.
With BlockingCollection a queue size of 8 managed the size variance and kept memory in check.
You could also set up parse to be parallel if parse is slower.
If you are reading from a single head then parallel read would not help.

static void Main(string[] args)
{
    // A blocking collection that can hold no more than 5 items at a time.
    BlockingCollection<string[]> fileCollection = new BlockingCollection<string[]>(5);

    // Start one producer and one consumer.
    Task.Factory.StartNew(() => NonBlockingConsumer(fileCollection));  // parse - can use parallel
    Task.Factory.StartNew(() => NonBlockingProducer(fileCollection));  // read
}

What is the nature of the parse?
You are parsing one line at a time?
I would look at parallel parsing of lines before parallel parsing of files.
Parallel.ForEach Method

paparazzo
  • 44,497
  • 23
  • 105
  • 176
0

You might get an improvement in performance by splitting the tasks, depending on processing required. The whole process sound similar to the producer/consumer setup.

You might want to check out Blocking Queue for queueing the processing. The idea would be to have the reading thread queue items, and the processing thread(s) to dequeue and process items.

Community
  • 1
  • 1
Kami
  • 19,134
  • 4
  • 51
  • 63