1

I need to extract a block of text from pdf. This text has the same font-family as characteristics. Any ideas? cheers

Edit: Let me ask the question in other way: How can i extract just the "Bold" text from pdf page?

user1677293
  • 9
  • 1
  • 4
  • You can derive your own text extracting class from `PDFTextStripper` and therein filter the data to be added to the extracted text. Depending on your source PDF, though, the actual issue might be to *recognize* bold written text. Sometimes it is easy if an actual bold font is used which announces its boldness. Sometimes, though, fonts don't tell, and sometimes mechanisms are used to emulate boldness, e.g. double drawing with a small offset or drawing using a larger stroke value. I'm not sure whether PDFBox recognizes all these techniques out of the box. – mkl Sep 19 '13 at 07:12
  • have you find the solution for this? – chinna_82 Jan 02 '14 at 08:24

1 Answers1

0
public String pdftoText(String fileName){
    try {
        File f = new File(fileName);
        if (!f.isFile()) {
            System.out.println("File not exist.");
            return null;
        }
        parser = new PDFParser(new FileInputStream(f));
        parser.parse();
        cosDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdDoc = new PDDocument(cosDoc);
        parsedText = pdfStripper.getText(pdDoc);
        cosDoc.close();
        pdDoc.close();
        return parsedText;
    } catch (IOException ex) {
        Logger.getLogger(PDFTextParser.class.getName()).log(Level.SEVERE, null, ex);
        return null;
    }
}

Before run:add pdfbox.jar to your project

Vahap Gencdal
  • 1,900
  • 18
  • 17