Extracting text from PDF files using PDFBox

Question

I am using the following code to get the whole textual content of any PDF file using PdfBox:

    private static void textExtraction() throws FileNotFoundException, UnsupportedEncodingException, IOException 

{
        String encoding = null;
        String outputFile = "path";


        Writer output = new OutputStreamWriter(new FileOutputStream( outputFile ) );            
        PDFTextStripper stripper = new PDFTextStripper(encoding);
        stripper.writeText( document, output );

    }

this code works perfectly fine. but the question is how can I extract a text and know where it is? I mean, for example, I want to extract text page by page and it writes it into different files or for example I want it to look for a keyword and then extracting those parts that the keyword happens with telling me that where it happens etc.

see: http://www.programming-free.com/2012/11/simple-word-search-in-pdf-files-using.html , http://stackoverflow.com/questions/5527868/exact-phrase-search-using-lucene — nyxaria, Feb 01 '15 at 12:41
You might want to look [here] (http://stackoverflow.com/a/25290318/1729265) to see how you can meddle with the text eventually extracted by a TextStripper. This can e.g. be used for context related filtering. — mkl, Feb 01 '15 at 21:44
@mkl did you tried to send two examples or one? cause i dont know if you meant to send two or just one. since your `[here]` code didnt work out. — user3049183, Feb 02 '15 at 05:31
@user3049183 Only one. The mistake was the space character between ']' and '(', Without it the ''here'' would have been linked to the URL thereafter. — mkl, Feb 02 '15 at 09:19
@mkl alright. that example was nice. at least. I got an idea that were should I look for more detail for text inspection. but there is a question. do the pdfbox detects titles or i just should consider a bold font could be a title? there is a feeling that things like `JKZAML+Arial-BoldMT` should carry some hidden information? or is it just some random characters which are jammed together? I do know `Arial` is a font and `Bold` means the format of the font, but what about the`MT` and also `JKZAML`? — user3049183, Feb 02 '15 at 10:23
*do the pdfbox detects titles* - No. PDFs do not necessarily mark title texts as *titles*, often title text is merely text in a different font. If your PDF is properly tagged, you could look at the tags in question, but generally that is not the case... *JKZAML+Arial-BoldMT should carry some hidden information? or is it just some random characters which are jammed together?* - Such six-random-letters-and-plus prefixes only indicate that the font in question is not embedded in the PDF as a whole but only as a subset. This does not single out a title font. — mkl, Feb 02 '15 at 12:12
@mkl so you mean that could be any six random letters for any pdf? — user3049183, Feb 02 '15 at 12:25
More exactly the specification calls that **a tag followed by a plus sign (+). The tag shall consist of exactly six uppercase letters; the choice of letters is arbitrary, but different subsets in the same PDF file shall have different tags.** — mkl, Feb 02 '15 at 13:15
@mkl got it. how about the `MT` after the `Bold`? I tried with different pdfs, some doesnt have any character after `Bold` or any other font type, some didnt make it to be complete. I mean, it was like just `medi` and `Regu` instead of the complete format. why is that so? it makes me confuse to program a code which fits for all cases — user3049183, Feb 02 '15 at 15:18
*how about the MT after the Bold?* - Thats the name of the font, [Arial-BoldMT](http://www.fontslog.com/arial-boldmt-otf-15234.htm)... *a code which fits for all cases* - probably impossible... there are conventions for font names, but they are not hard rules. Thus, there always will be someone naming a font to not match your expectations. — mkl, Feb 02 '15 at 15:51
@mkl so, how is it possible to detect titles in this case? just by detecting Bold? — user3049183, Feb 07 '15 at 02:25
Incorrect a representative collection of PDFs and find criteria to recognize their titles. There is no criterion recognizing all titles without error. — mkl, Feb 07 '15 at 07:56
"Incorrect" should have been "inspect"... Smart phone keyboards. — mkl, Feb 07 '15 at 09:34

Extracting text from PDF files using PDFBox

0 Answers0