0

I am new to Apache PDFBox; below is my code to extract all the text from a simple resume. It 's working fine and now I want to get the text by fonts, bold, images etc. How do I do this?

import java.io.File;
import java.io.IOException;
import java.util.Map;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.util.PDFTextStripper;

public class PdfExtract {

    public static void main(String args[]) throws IOException {

    PDDocument pdf = PDDocument.load(new File("/home/praveen/Downloa/sampleresume.pdf"));
    PDFTextStripper stripper = new PDFTextStripper();

    String plainText = stripper.getText(pdf
    System.out.println(plainText);
  }
 }
War10ck
  • 12,387
  • 7
  • 41
  • 54
praveen
  • 17
  • 1
  • 1
  • 9
  • You'll find some information on how to extract style information alongside the plain text in [this answer](http://stackoverflow.com/questions/20878170/how-to-determine-artificial-bold-style-artificial-italic-style-and-artificial-o/20924898#20924898). While that answer focuses on artificial styles (e.g. bold by drawing characters twice), you can use it to get the general idea: The regular font information is also contained in the `TextPosition` objects talked about there. Images are not directly covered by the `PDFTextStripper` (it's a text stripper after all...) – mkl Jan 29 '14 at 07:45
  • Images are not directly covered by the PDFTextStripper (it's a text stripper after all...) but may be analogously extracted using some other stripper based on the same base class `PDFStreamEngine` if you need the images including their position. If you merely need the non-inlined images without position, there are multiple SO answers showing how to get those images from the resources. How to extract "etc" depends on the very nature of that etc. – mkl Jan 29 '14 at 07:49
  • can you give some of the sample code to do this – praveen Jan 29 '14 at 08:35
  • Have a look at the answer I pointed to in my first comment. – mkl Jan 29 '14 at 12:07
  • i seen its not understanding since iam new to this..please explian here – praveen Jan 30 '14 at 08:57
  • What are you not understanding? Please ask specific questions. [That answer](http://stackoverflow.com/questions/20878170/how-to-determine-artificial-bold-style-artificial-italic-style-and-artificial-o/20924898#20924898) shows how to use a replacement for the `PDFTextStripper` you already use for extracting information to deduce some artificial styles. [This other answer](http://stackoverflow.com/questions/21430341/identifying-the-text-based-on-the-output-in-pdf-using-pdfbox/21453780#21453780) shows how to extend it to also extract text color. Comparing those you should see a pattern... – mkl Jan 30 '14 at 10:40
  • thanks for the idea..i got it now – praveen Jan 31 '14 at 11:45
  • @War10ck Why did you edit this question only to make the code buggy? – mkl Mar 03 '14 at 15:45
  • @mkl Because it was an invalid edit. The OP entered the code as he/she had it. The edit that was made could have been the problem, or part of the problem that the OP was having. If you think the code was buggy, it should be fixed as an answer. Fixing the code in the question will invalidate it for future visitors. They won't be able to see what was wrong initially. If the OP made a typo, then he/she should say so or edit the code themselves. You can't assume however, that it is a typo. – War10ck Mar 03 '14 at 16:28
  • @War10ck *The OP entered the code as he/she had it* - No. He said his code was working fine but he wanted to improve it. The code he originally posted was not compilable. Thus, the edit made the code in the question match the code the OP actually used. The edit, therefore, was **not** invalid. – mkl Mar 03 '14 at 19:54
  • @mkl If that is the way you view it, then by all means edit the correction. I wasn't trying to cause an edit battle here. All I'm saying is the code as it appears now, is how the OP entered it. Now, he/she may have made a typo or edit mistake when they pasted it in, but what they pasted in was, as you stated, ***not*** compilable. There was no indication that the edit _"made the code in the question match the code the OP actually used."_. We can't draw this conclusion. From what was posted, the code didn't work, and therefore the OP's statement was false. – War10ck Mar 03 '14 at 20:07
  • 1
    @War10ck *I wasn't trying to cause an edit battle here* - neither was I. I just wondered about the motivation for the edit. Now I understand it, even though I don't share it. At best the OP should correct the code to match his actual code. – mkl Mar 03 '14 at 21:03
  • @mkl Agreed. My apologies for the confusion. – War10ck Mar 03 '14 at 21:06

0 Answers0