I'm trying to extract text from my PDF file, which is very large (91914 pages), I'm using ITextSharp for text extraction. This takes a long time so I need to use Task or Threads to improve the speed of the process.
I have something like this:
var processTask = new List<Task>();
using (PdfReader reader = new PdfReader(fileInfo.FullName))
{
for (int startpage = 1;
startpage <= reader.NumberOfPages;
startpage = startpage + num + 1)
{
processTask.Add(Task.Run(() => ProccesSinglePDF(
reader,
sourcePath + "PDFs\\" + (object)startpage + ".pdf",
startpage,
startpage + num,
new FileInfo(
sourcePath + "PDFs\\" + (object)startpage + ".pdf"), searchText)));
}
foreach (var task in processTask)
{
await task;
}
}
Inside the ProcessSinglePDF method is searching for the Text I'm looking for and making some calls to the database (get data and update some values) and it seems like it is not doing it right because it finishes so quickly and doesn't process all the pages (I know it because I put a console.WriteLine(startpage) to confirm)