0

i'm trying to extract text from a rectangle with ItextSharp, and it works fine with almost all the sections inside the document, except for some specific areas. These areas are simple bold caps titles and simple content with a slighter small font than the rest of the doc (both uppercase). In these areas i get an anagram of the selected text instead of the correct words.

For example the word "RELEASE" is ridden as "ERLEASE", "VOYAGE" becomes "EGAYVO", the sentence "FURTHER CHARGES" becomes "FHTRU E R CHAGR E S"

The odd thing is that if i try to the full page with a SimpleTextExtractionStrategy , i obtain the correct text.

The pdf's font is classic Arial and the strategy i used for the extraction is taken from StackOverflow (rect it's passed by args):

    _pdfRd = New PdfReader(_pdfPath)
    Dim output As String()
    Dim nrPag as Integer = 1
    Dim filter As RenderFilter = New RegionTextRenderFilter(rect)
    Dim strategy As FilteredRenderListener
    Dim locStrategy As New LocationTextExtractionStrategy
    strategy = New FilteredTextRenderListener(locStrategy, {filter})
    output = GetTextFromPage(_pdfRd, nrPag, strategy).Split(vbLf)
    _pdfRd.Close()

I tryed with other documents and it works very well, i'm not able to reproduce this issue with different documents.

I'm worried about my code and i tryed this strategy too: http://www.schiffhauer.com/read-text-in-a-pdf-in-c-with-itextsharp/ but the result it's the same.

I'm missing something in the read process or it's a problem related to my pdf?

UPDATE: If i select a single letter of a faulty word, the output is empty string, this also happens if i select more letters together, i obtain a (anagram) output only if i select the whole word. It's really odd, for example i noticed if i have the words "CARGO RELEASE", and i select with a rectangle only "GO" or any other substr i get nothing, but if i select "CARGO" i obtain "GRACO ERLESAE" and i haven't selected the second word, only the first one.

  • My first guess is that the text is not 100% on the same height, which the default `LocationTextExtractionStrategy` is somewhat vulnerable to. So a bit of text which is placed higher will appear at the beginning of the output. Could you upload your input document somewhere ? – blagae Jan 12 '16 at 11:14
  • I'm sorry but i can't upload the pdf, maybe i can upload a screenshot of it without sensible data. Tell me if it'll be helpfull – Mattia Biggi Jan 12 '16 at 11:28
  • It is most likely that there's something with your PDF. Not necessarily something wrong, but something that the default implementation breaks on. Since you can't share the document, I can only suggest that you copy-paste the source code of `LocationTextExtractionStrategy` into your project, use that local strategy object, and modify the code (e.g. set DUMP_STATE to true) so you can get more info. – blagae Jan 12 '16 at 12:14
  • *I'm sorry but i can't upload the pdf* - you are dealing with an issue which seems specific to your very PDF, even only to specific sections of it. Thus, you can't seriously expect help without providing a sample document to reproduce the issue. That been said, if @blagae's assumption that the issue is about text not 100% on the same height is indeed your problem, the `HorizontalTextExtractionStrategy` from [this answer](http://stackoverflow.com/a/33697745/1729265) might help. – mkl Jan 12 '16 at 17:23

1 Answers1

0

Have you tried to customize the working SimpleTextExtractionStrategy, in a way that it takes not the full page but the rectangle?

You can find the full code in the ghitub project here: https://github.com/itext/itextsharp/blob/75f05dd7d87797b86c44649f5f96df2d90d730e8/src/extras/itextsharp.tests/iTextSharp/text/pdf/parser/SimpleTextExtractionStrategyTest.cs

Stingi
  • 48
  • 5