How to find text from pdf image?

Question

I am developing a C# application in which I am converting a PDF document to an image and then rendering that image in a custom viewer.

I've come across a bit of a brick wall when trying to search for specific words in the generated image and I was wondering what the best way to go about this would be. Should I find the x,y location of searched word?

I have tried ITextSharp and aspose Library for extracting text from pdf and then find word from that text but I want to find text from an Image. — urz shah, Sep 25 '12 at 07:44

score 9 · Accepted Answer · edited May 23 '17 at 10:30

You can use tessract OCR image for text recognition in console mode.

I don't know about such SDK for pdf.

BUT, if you want to get all word coordinates and values, you can use next my not complex code, thank nguyenq for hocr hint:

public void Recognize(Bitmap bitmap)
{
    bitmap.Save("temp.png", ImageFormat.Png);
    var startInfo = new ProcessStartInfo("tesseract.exe", "temp.png temp hocr");
    startInfo.WindowStyle = ProcessWindowStyle.Hidden;
    var process = Process.Start(startInfo);
    process.WaitForExit();

    GetWords(File.ReadAllText("temp.html"));

    // Futher actions with words
}

public Dictionary<Rectangle, string> GetWords(string tesseractHtml)
{
    var xml = XDocument.Parse(tesseractHtml);

    var rectsWords = new Dictionary<System.Drawing.Rectangle, string>();

    var ocr_words = xml.Descendants("span").Where(element => element.Attribute("class").Value == "ocr_word").ToList();
    foreach (var ocr_word in ocr_words)
    {
        var strs = ocr_word.Attribute("title").Value.Split(' ');
        int left = int.Parse(strs[1]);
        int top = int.Parse(strs[2]);
        int width = int.Parse(strs[3]) - left + 1;
        int height = int.Parse(strs[4]) - top + 1;
        rectsWords.Add(new Rectangle(left, top, width, height), ocr_word.Value);
    }

    return rectsWords;
}

I do not want make things complex.Is there any simple way or sdk that gives me coordinates of any word in the pdf image. — urz shah, Sep 25 '12 at 07:20
@urz - This IS a simple SDK. OCR, or text extraction, is not very complex but it is not very simple either. You will have to put some level of effort or work in to get this problem solved. — Kieren Johnstone, Sep 25 '12 at 07:38
Yes, but it's exe, not a dll, because of tesseract is C++ library and it's easier to use it in console mode, how it's displayed in my code. — Ivan Kochurkin, Sep 25 '12 at 07:49

RSB · Answer 2 · 2012-09-25T07:20:24.293

2

Use ITextSharp download it here. Make sure the PDF is searchable.

and use this code:

public static string GetTextFromAllPages(String pdfPath)
{
    PdfReader reader = new PdfReader(pdfPath); 

    StringWriter output = new StringWriter();  

    for (int i = 1; i <= reader.NumberOfPages; i++) 
        output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));

    return output.ToString();
}

edited Sep 25 '12 at 07:20

answered Sep 25 '12 at 07:14

RSB

239
1
4
18

I do not want make things complex.Is there any simple way or sdk that gives me coordinates of any word in the pdf image. – urz shah Sep 25 '12 at 07:21
1

@urzshah - I can't think how this isn't incredibly simple. 5 lines of code is not complex. You may need to rethink what you mean by 'simple' because it doesn't get much easier than this – Kieren Johnstone Sep 25 '12 at 07:39
1

I want to find text from pdf image not from actual pdf.Above code seems to extract text from pdf not from pdf image. – urz shah Sep 25 '12 at 07:49

How to find text from pdf image?

2 Answers2

Linked