0

I want to search particular text from PDF file, if PDF Contains Image or Paragraph i want search text from both Image and Paragraph too. and show it on view, How i can achieve this.

I have following code from another source, but i don't know weather it is searching text in image or not.

 string file = Server.MapPath("~/images/OoPdfFormExample.pdf");
            if (System.IO.File.Exists(file))
            {
                string searchText = txtSearh.Text.Trim();
                string currentText = string.Empty;
                System.Text.StringBuilder pdfText = new System.Text.StringBuilder();
                iTextSharp.text.pdf.PdfReader pdfReader = new iTextSharp.text.pdf.PdfReader(file);
                for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                {
                    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                    currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
                    currentText = System.Text.Encoding.UTF8.GetString(Encoding.Convert (Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
                    pdfText.Append(currentText);
                }
                pdfReader.Close();
                List<string> lines = new List<string>();
                lines = pdfText.ToString().Trim().Split(' ').ToList();
                List<string> matchedWord = new List<string>();
                foreach (string item in lines)
                {
                    if (!string.IsNullOrEmpty(item))
                    {
                        if (item.ToUpper().Contains(searchText.ToUpper()))
                        {
                            matchedWord.Add(item);
                        }
                    }
                }
            }

can somebody help ??

PK-1825
  • 1,431
  • 19
  • 39
  • is your pdf file a scanned image? – Allanckw Dec 26 '17 at 05:00
  • yes its a scanned pdf image – PK-1825 Dec 26 '17 at 05:11
  • 2
    Possible duplicate of [How to find text from pdf image?](https://stackoverflow.com/questions/12577752/how-to-find-text-from-pdf-image) – TheGeneral Dec 26 '17 at 05:17
  • I have image inside pdf file, which contain some text.I want to search particular text from Image that inside pdf file as well as text from pdf.. – PK-1825 Dec 26 '17 at 05:25
  • 1
    Text extraction from an image requires OCR. Tessaract, mentioned in the duplicate, is such an OCR. – Amedee Van Gasse Dec 26 '17 at 05:53
  • iTextSharp only extract text based pdf file, you need OCR for image to text extraction – Allanckw Dec 26 '17 at 08:14
  • 1
    iText (the company) is looking into OCR technology. A proof of concept has been developed. But currently we are looking into which OCR vendor suits our software offering best. The underlying idea would be to have iText (the library) extract the images from a PDF, then feed them into some OCR software, and finally let iText handle the coordinate transformation (from image space to pdf space). – Joris Schellekens Dec 26 '17 at 11:03

0 Answers0