PdfBox extract text with same font-family from pdf

Question

I need to extract a block of text from pdf. This text has the same font-family as characteristics. Any ideas? cheers

Edit: Let me ask the question in other way: How can i extract just the "Bold" text from pdf page?

You can derive your own text extracting class from `PDFTextStripper` and therein filter the data to be added to the extracted text. Depending on your source PDF, though, the actual issue might be to *recognize* bold written text. Sometimes it is easy if an actual bold font is used which announces its boldness. Sometimes, though, fonts don't tell, and sometimes mechanisms are used to emulate boldness, e.g. double drawing with a small offset or drawing using a larger stroke value. I'm not sure whether PDFBox recognizes all these techniques out of the box. — mkl, Sep 19 '13 at 07:12

score 0 · Answer 1 · answered Jan 14 '14 at 21:57

public String pdftoText(String fileName){
    try {
        File f = new File(fileName);
        if (!f.isFile()) {
            System.out.println("File not exist.");
            return null;
        }
        parser = new PDFParser(new FileInputStream(f));
        parser.parse();
        cosDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdDoc = new PDDocument(cosDoc);
        parsedText = pdfStripper.getText(pdDoc);
        cosDoc.close();
        pdDoc.close();
        return parsedText;
    } catch (IOException ex) {
        Logger.getLogger(PDFTextParser.class.getName()).log(Level.SEVERE, null, ex);
        return null;
    }
}

Before run:add pdfbox.jar to your project

This does not at all check font characteristics like **Bold** as requested by the op, does it? — mkl, Jan 15 '14 at 05:45
http://stackoverflow.com/questions/19770987/how-to-extract-bold-text-from-pdf-using-pdfbox — Vahap Gencdal, Jan 15 '14 at 07:10

PdfBox extract text with same font-family from pdf

1 Answers1