1

I'm trying to get text from pdf using Square Annotation. I use below code to extract text from PDF using PDFBOX.
CODE

try {    
            PDDocument document = null;
            try {
                document = PDDocument.load(new File("//Users//" + usr + "//Desktop//BoldTest2 2.pdf"));
                List allPages = document.getDocumentCatalog().getAllPages();
                for (int i = 0; i < allPages.size(); i++) {
                    PDPage page = (PDPage) allPages.get(i);
                    Map<String, PDFont> pageFonts = page.getResources().getFonts();
                    List<PDAnnotation> la = page.getAnnotations();
                    for (int f = 0; f < la.size(); f++) {
                        PDAnnotation pdfAnnot = la.get(f);
                        PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                        stripper.setSortByPosition(true);
                        PDRectangle rect = pdfAnnot.getRectangle();

                        float x = 0;
                        float y = 0;
                        float width = 0;
                        float height = 0;
                        int rotation = page.findRotation();

                        if (rotation == 0) {
                            x = rect.getLowerLeftX();
                            y = rect.getUpperRightY() - 2;
                            width = rect.getWidth();
                            height = rect.getHeight();
                            PDRectangle pageSize = page.findMediaBox();
                            y = pageSize.getHeight() - y;
                        }
                        Rectangle2D.Float awtRect = new Rectangle2D.Float(x, y, width, height);
                        stripper.addRegion(Integer.toString(f), awtRect);
                        stripper.extractRegions(page);
                        PrintTextLocation2 prt = new PrintTextLocation2();
                        if (pdfAnnot.getSubtype().equals("Square")) {
                            testTxt = testTxt + "\n " + stripper.getTextForRegion(Integer.toString(f));
                        }
                    }
                }
            } catch (Exception ex) {
            } finally {
                if (document != null) {
                    document.close();
                }
            }
        } catch (Exception ex) {
        }

By using this code, I am only able to get the PDF text. How do I do to get the font information like BOLD ITALIC together within the text. Advice or references are highly appreciated.

Yehia Awad
  • 2,898
  • 1
  • 20
  • 31
chinna_82
  • 6,353
  • 17
  • 79
  • 134
  • Have a look at [this answer](http://stackoverflow.com/questions/20878170/how-to-determine-artificial-bold-style-artificial-italic-style-and-artificial-o/20924898#20924898) to see the *general procedure* (deriving from `PDFTextStripper` and overriding `writeString`) and the current issue with it. The `TextPosition` instances given to that method contain some information about the font and the rest of the currnt state while drawing the text. Whether you have to derive the style information from the font itself or from some graphics state, depends on how the style is generated. – mkl Jan 06 '14 at 09:29

1 Answers1

3

The PDFTextStripper which is extended by PDFTextStripperByArea normalizes (i.e., removes formatting of) the text (cf. JavaDoc comment):

* This class will take a pdf document and strip out all of the text and ignore the
* formatting and such.

If you look at the source, you will see that the font information is available in this class, but it is normalized out before printing:

protected void writePage() throws IOException
{
    [...]
        List<TextPosition> line = new ArrayList<TextPosition>();
        [...]
            if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine))
            {
                writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant);
                line.clear();
                [...]
            }
............

The TextPosition instances in the ArrayList have all the formatting information. Solutions can focus on re-defining the existing methods as per the requirement. I am listing a few options below:

  • private List normalize(List line, boolean isRtlDominant, boolean hasRtl)

If you want your own normalize method, you can copy the whole PDFTextStripper class in your project and change the code of the copy. Let's call this new class as MyPDFTextStripper and then define new method as per the requirement. Similarly copy PDFTextStripperByArea as MyPDFTextStripperByArea which would extend MyPDFTextStripper.

  • protected void writePage()

If you just need a new writePage method, you can simply extend PDFTextStripper, and override this method, then create MyPDFTextStripperByArea as described above.

  • writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant)

Other solution might override writeLine method by storing the pre-normalization information in some variable and then using it.

Hope this helps.

Salil
  • 1,739
  • 2
  • 15
  • 25
  • Depending on which of the mentioned approaches you take, you may be subject to the PDFBox issue [PDFBOX-1804](https://issues.apache.org/jira/browse/PDFBOX-1804) which presumably will be resolved as of version 1.8.4. – mkl Jan 07 '14 at 10:18
  • Have you any idea how to leave, lets say, only "bold" feature of the text? – Darius Miliauskas Jan 15 '15 at 03:07