3

I'm working with pdf in hebrew language with diacritical marks. I want to extract all the words with its coordinates. I tried to use ITextSharp and pdfClown and they both didn't give me what I want.

In pdfClown there are missing letters\chars in ITextSharp I don't get the words coordinates.

Is there a way to do it? (I'm looking for a free framework\code)

EDIT:

PDFClown Code:

    File file = new File(PDFFilePath);
    TextExtractor te = new TextExtractor();
    IDictionary<RectangleF?, IList<ITextString>> strs = te.Extract(file.Document.Pages[0].Contents);

    List<string> correctText = new List<string>();
    foreach (var key in strs.Keys)
    {
        foreach (var value in strs[key])
        {
            string reversedText = new string(value.Text.Reverse().ToArray());
            string cleanText = RemoveDiacritics(reversedText);
            correctText.Add(cleanText);
        }
    }
Alex K
  • 5,092
  • 15
  • 50
  • 77
  • As you didn't explain how exactly you tried it using iTextSharp or PDFClown, it's difficult to tell what you did wrong. – mkl Sep 26 '15 at 17:15
  • I added the code for pdf clown. As for itextsharp I don't have the code..but if you know how to do it please tell me. – Alex K Sep 26 '15 at 17:21
  • That is quite unfortunate, after all there you seem to have gotten all the words, merely not the positions, and adding that is not to difficult. In the context of pdfclown, can you share an example PDF and point out which letters where missing? – mkl Sep 26 '15 at 17:24
  • http://www.filedropper.com/test23 there are multiple missing letter, for example:מָתֵמָטִיקָה לְבֵית-הַסֵּפֶר הַיְּסוֹדִי - for this line I get: מָ תֵ מָ טִ יקָ ה לְ בֵ ית-הַ ֵּ פֶ ר הַ ְּ סֹדִ י Now for one word, for example: הַיְּסוֹדִי I get: הַ ְּ סֹדִ י – Alex K Sep 26 '15 at 17:34
  • Ok, I'll look into that tomorrow in office. – mkl Sep 27 '15 at 08:08

1 Answers1

2

You aren't showing how you are trying to extract text using iText(Sharp). I am assuming that you are following the official documentation and that your code looks like this:

public string ExtractText(byte[] src) {
    PdfReader reader = new PdfReader(src);
    MyTextRenderListener listener = new MyTextRenderListener();
    PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
    PdfDictionary pageDic = reader.GetPageN(1);
    PdfDictionary resourcesDic = pageDic.GetAsDict(PdfName.RESOURCES);
    processor.ProcessContent(
       ContentByteUtils.GetContentBytesForPage(reader, 1), resourcesDic);
    return listener.Text.ToString();
}

If your code doesn't look like this, this explains already explains the first thing you're doing wrong.

In this method, there is one class that isn't part of iTextSharp: MyTextRenderListener. This is a class you should write and that looks for instance like this:

public class MyTextRenderListener : IRenderListener {
    public StringBuilder Text { get; set; }

    public MyTextRenderListener() {
        Text = new StringBuilder();
    }
    public void BeginTextBlock() {
        Text.Append("<");
    }
    public void EndTextBlock() {
        Text.AppendLine(">");
    }
    public void RenderImage(ImageRenderInfo renderInfo) {
    }
    public void RenderText(TextRenderInfo renderInfo) {
        Text.Append("<");
        Text.Append(renderInfo.GetText());
        LineSegment segment = renderInfo.GetBaseline();
        Vector start = segment.GetStartPoint();
        Text.Append("| x=");
        Text.Append(start[Vector.I1]);
        Text.Append("; y=");
        Text.Append(start[Vector.I2]);
        Text.Append(">");
    }    
}

When you run this code, and you look what's inside Text, you'll notice that a PDF document doesn't store words. Instead, it stores text blocks. In our special IRenderListener, we indicate the start and the end of text blocks using < and >. Inside these text blocks, you'll find text snippets. We'll mark text snippets like this: <text snippet| x=36.0000; y=806.0000> where the x and y value give you the coordinate of the start of the baseline (as opposed to the ascent and descent position). You can also get the end position of the baseline (and the ascent/descent).

Now how do you distill words out of all of this? The problem with the text snippets you get, is that they don't correspond with words. See for instance this file: hello_reverse.pdf

When you open it in Adobe Reader, you read "Hello World Hello People." You'd hope you'd find four words in the content stream, wouldn't you? In reality, this is what you'll find:

<>
<<ld><Wor><llo><He>>
<<Hello People>>

To distill the words, "World" and "Hello" from the first line, you need to do plenty of Math. Instead of getting the base line of the TextRenderInfo object returned in the RenderText() method of your render listener, you have to use the GetCharacterRenderInfos() method. This will return a list of TextRenderInfo objects that gives you more info about every character (including the position of those characters). You then need to compose the words from those different characters.

This is explained in mkl's answer to this question: Retrieve the respective coordinates of all words on the page with itextsharp

We've done similar projects. One of them is described here: https://www.youtube.com/watch?v=lZnbhnU4m3Y

You'll need to do quite some coding to get it right. One word about PdfClown: your text is probably stored as UNICODE in your PDF. To retrieve the correct characters, the parser needs to examine the mapping of the glyphs stored in the font and the corresponding UNICODE character. If PdfClown can't do this, this means that PdfClown doesn't do this task correctly. PdfClown is a one man project, so you'll have to ask that developer to fix this (if he has the time).

As you can tell from the video, iText could help you out, but iText is a company with subsidiaries in the US, Belgium and Singapore. It is a company with many employees and to keep that company running, we need to make money (that's how we pay our employees). Hence you shouldn't expect that we help you for free. Surely you can understand this as you wouldn't want to work for free either, would you?

Community
  • 1
  • 1
Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
  • what can I do in case there are two chars\letters with the same coordinates? start position are the same.. – Alex K Sep 28 '15 at 17:20
  • Are those characters by any chance making ligatures? Also: each char has metrics such as "advance" and "bounding box". These metrics can play an important role too. – Bruno Lowagie Sep 29 '15 at 06:19
  • I think so (ligatures), why do you ask? the bounding box are the same and I couldn't file "advance" (just to remind you I'm working with .net c#) – Alex K Sep 29 '15 at 08:29
  • Of ligatures are at play, the normal thing would be to replace two different characters by a single character in which the ligature is made. E.g. replace `et` by `&`, but sometimes, a ligature is made by adding two characters that overlap. Anyway: it's hard to comment on this without saying what you're talking about exactly. – Bruno Lowagie Sep 29 '15 at 08:55
  • I'm trying to read all the text from this pdf filedropper.com/test23 and get the coordinates of each word. I managed to get all the letters but some of them has the same coordinates and I can't assemble a word cause I don't know where to put the letters with the same coords. I found the class: HebrewProcessor Is it somehow related to my solution? – Alex K Sep 29 '15 at 09:34