There are some tools which allow to extract the whole text portion of a PDF file in order to full text index the PDF.
What I need is a way to search for certain strings and, if thery were found in the PDF file, return the page number?
There are some tools which allow to extract the whole text portion of a PDF file in order to full text index the PDF.
What I need is a way to search for certain strings and, if thery were found in the PDF file, return the page number?
This example uses the library included with Adobe Reader, and comes from http://www.dotnetspider.com/resources/5040-Get-PDF-Page-Number.aspx:
using Acrobat;
using AFORMAUTLib;
private void pdfRandD(string fPath)
{
AcroPDDocClass objPages = new AcroPDDocClass();
objPages.Open(fPath);
long TotalPDFPages = objPages.GetNumPages();
objPages.Close();
AcroAVDocClass avDoc = new AcroAVDocClass();
avDoc.Open(fPath, "Title");
IAFormApp formApp = new AFormAppClass();
IFields myFields = (IFields)formApp.Fields;
string searchWord = "Search String";
string k = "";
StreamWriter sw = new
StreamWriter(@"D:\KCG_FileChecker_Inputs\MAC\pdf\0230_525490_23_cha17.txt", false);
for (int p = 0; p < TotalPDFPages; p++)
{
int numWords = int.Parse(myFields.ExecuteThisJavascript("event.value=this.getPageNumWords(" + p + ");"));
k = "";
for (int i = 0; i < numWords; i++)
{
string chkWord = myFields.ExecuteThisJavascript("event.value=this.getPageNthWord(" + p + "," + i + ", true);");
k = k + " " + chkWord;
}
if(k.Trim().Contains(searchWord))
{
int pNum = int.Parse(myFields.ExecuteThisJavascript("event.value=this.getPageLabel(" + p + ",true);"));
sw.WriteLine("The Word " + searchWord + " is exists in " + pNum);
}
}
sw.Close();
MessageBox.Show("Process completed");
}
You can use Docotic.Pdf library to search for text in PDF files.
Following sample shows how to find specified strings in a PDF file and corresponding page numbers:
static void searchForTextStrings()
{
string path = "";
string[] stringsToFind = new string[] { };
using (PdfDocument pdf = new PdfDocument(path))
{
for (int i = 0; i < pdf.Pages.Count; i++)
{
string pageText = pdf.Pages[i].GetText();
foreach (string s in stringsToFind)
{
int index = pageText.IndexOf(s, 0, StringComparison.CurrentCultureIgnoreCase);
if (index != -1)
Console.WriteLine("'{0}' found on page {1}", s, i);
}
}
}
}
A case-sensitive search can be conducted if you remove third argument of IndexOf method.
Disclaimer: I work for Bit Miracle, vendor of the library.