1

I'm trying to extract text from a rotated PDF page: the page has "/Rotate 90" instruction inside. This mean page is rotated when displayed, but it seems not be rotated when extracting text with PdfTextExtractor and LocationTextExtractionStrategy. I followed example by Mr. Lowagie on this link

I tryed to rotate area instead of page, but it seem to extract whole text block as one piece instead the exact selected area.

I'm using iText 5.5.12 with Java 1.8

How can I rotate the page for extraction?

Update

The code I use is like this:

PdfReader reader = null;
    try {
        reader = new PdfReader("C:\\Temp\\rotated.pdf");
        Rectangle rect = new Rectangle(480, 484, 576, 525);
        final Rectangle pageRect = reader.getPageSize(1);
        RenderFilter regionFilter = new RegionTextRenderFilter(rect);
        TextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(),
                regionFilter);
        System.out.println(">>" + PdfTextExtractor.getTextFromPage(reader, 1, strategy).trim());
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        if (reader != null)
            reader.close();
    }

I can't find a way to upload here an example PDF. I put this image taken from Gimp with selected area. Pdf was created with LibreOffice export function and then manually edited to add /Rotate command.

Given coordinates consider zero point on lower-right corner.

Program output is empty string.

enter image description here

fante76
  • 55
  • 6
  • 1
    Can you show the code which failed you and the PDF in question? – mkl Dec 05 '17 at 15:02
  • *"I can't find a way to upload here an example PDF."* - Usually people use public googledrive or dropbox shares and post the URL here; other ad-free file sharing services might also be acceptable. – mkl Dec 05 '17 at 17:24
  • I'm afraid, though, you will have to calculate the rotation into your coordinates yourself: the parsing framework explicitly uses the original coordinates from the page content stream and does not calculate the **Rotate** value into it. – mkl Dec 05 '17 at 17:27
  • I already applied rotation to the area, but it seem to extract the whole block of text even if I select only the word "text": result is "This is a text". It's like iText find the chunck of text but it can not apply the filter because rotation. I tryied with PDFBox and it correctly applies rotation, so I think it's a iText miss. – fante76 Dec 06 '17 at 08:09
  • This might be due to iText by default extracting text by the chunks used as parameters of the PDF text drawing instructions while PDFBox always cuts down these chunks to the individual glyphs. Thus, when filtering for regions with iText, you get all the chunks intersecting the region *including the parts exceeding it!* One can tell iText to also split the chunks using classes like the `TextRenderInfoSplitter` as explained [in this answer](https://stackoverflow.com/a/21023311/1729265). Beware: As you have not shared your PDF yet, I cannot tell whether this really solves your issue. – mkl Dec 06 '17 at 09:23
  • Informations from the link you suggested solved my problem. Anyway, iText should evaluate to bring those corrections into its code. Thank you very mutch – fante76 Dec 06 '17 at 10:37

0 Answers0