Multithreading with Task reading PDF files using C#

Question

I'm trying to extract text from my PDF file, which is very large (91914 pages), I'm using ITextSharp for text extraction. This takes a long time so I need to use Task or Threads to improve the speed of the process.

I have something like this:

var processTask = new List<Task>();
using (PdfReader reader = new PdfReader(fileInfo.FullName))
{

   for (int startpage = 1; 
         startpage <= reader.NumberOfPages;
         startpage = startpage + num + 1)
   {
     processTask.Add(Task.Run(() => ProccesSinglePDF(
       reader, 
       sourcePath + "PDFs\\" + (object)startpage + ".pdf",
       startpage,
       startpage + num,
       new FileInfo(
            sourcePath + "PDFs\\" + (object)startpage + ".pdf"), searchText)));
   }

   foreach (var task in processTask)
   {
      await task;
   }
}

Inside the ProcessSinglePDF method is searching for the Text I'm looking for and making some calls to the database (get data and update some values) and it seems like it is not doing it right because it finishes so quickly and doesn't process all the pages (I know it because I put a console.WriteLine(startpage) to confirm)

I know absolutely nothing about reading PDF or ITextSharp specifically, but generally "readers" are not expected to be used in parallel from multiple threads. Please while waiting for more specific feedback on your question read documentation to confirm that what you are doing is expected to be written that way; also re-read [mre] guidance for posting code and make sure what you have is enough to understand the problem (I'm pretty sure it is not enough as there is no way to know how the code (mis)uses the reader) — Alexei Levenkov, Oct 16 '22 at 02:47
Most likely your code is susceptible to this problem: [Captured variable in a loop in C#](https://stackoverflow.com/questions/271440/captured-variable-in-a-loop-in-c-sharp). — Theodor Zoulias, Oct 16 '22 at 03:13

score 3 · Accepted Answer · answered Oct 16 '22 at 07:54

An observation or two as I am not familiar with ITextSharp

from the github repo here https://github.com/schourode/iTextSharp-LGPL/blob/master/src/core/iTextSharp/text/pdf/PdfReader.cs it seems that this class is not thread safe (many fields, no locks) by design.

You are using a single instance shared by many threads (this seems to be your intent, though Task.Run does not necessarily spawn threads)

The PdfReader does expose an alternative constructor, that you should be using to provide each of your threads a unique instance of the PdfReader (copied from the repo above)

    /** Creates an independent duplicate.
    * @param reader the <CODE>PdfReader</CODE> to duplicate
    */    
    public PdfReader(PdfReader reader) {

create and await a new task for each range of pages with the unique PdfReader instance and then AwaitAll tasks

Thanks, @MaLio, I will check the repo and follow your recommendation! — Elmer A. Chacon, Oct 17 '22 at 00:58

Multithreading with Task reading PDF files using C#

1 Answers1