1

I am using the following code to get the whole textual content of any PDF file using PdfBox:

    private static void textExtraction() throws FileNotFoundException, UnsupportedEncodingException, IOException 

{
        String encoding = null;
        String outputFile = "path";


        Writer output = new OutputStreamWriter(new FileOutputStream( outputFile ) );            
        PDFTextStripper stripper = new PDFTextStripper(encoding);
        stripper.writeText( document, output );

    }

this code works perfectly fine. but the question is how can I extract a text and know where it is? I mean, for example, I want to extract text page by page and it writes it into different files or for example I want it to look for a keyword and then extracting those parts that the keyword happens with telling me that where it happens etc.

user3049183
  • 136
  • 1
  • 3
  • 16
  • see: http://www.programming-free.com/2012/11/simple-word-search-in-pdf-files-using.html , http://stackoverflow.com/questions/5527868/exact-phrase-search-using-lucene – nyxaria Feb 01 '15 at 12:41
  • You might want to look [here] (http://stackoverflow.com/a/25290318/1729265) to see how you can meddle with the text eventually extracted by a TextStripper. This can e.g. be used for context related filtering. – mkl Feb 01 '15 at 21:44
  • @mkl did you tried to send two examples or one? cause i dont know if you meant to send two or just one. since your `[here]` code didnt work out. – user3049183 Feb 02 '15 at 05:31
  • @user3049183 Only one. The mistake was the space character between ']' and '(', Without it the ''here'' would have been linked to the URL thereafter. – mkl Feb 02 '15 at 09:19
  • @mkl alright. that example was nice. at least. I got an idea that were should I look for more detail for text inspection. but there is a question. do the pdfbox detects titles or i just should consider a bold font could be a title? there is a feeling that things like `JKZAML+Arial-BoldMT` should carry some hidden information? or is it just some random characters which are jammed together? I do know `Arial` is a font and `Bold` means the format of the font, but what about the`MT` and also `JKZAML`? – user3049183 Feb 02 '15 at 10:23
  • *do the pdfbox detects titles* - No. PDFs do not necessarily mark title texts as *titles*, often title text is merely text in a different font. If your PDF is properly tagged, you could look at the tags in question, but generally that is not the case... *JKZAML+Arial-BoldMT should carry some hidden information? or is it just some random characters which are jammed together?* - Such six-random-letters-and-plus prefixes only indicate that the font in question is not embedded in the PDF as a whole but only as a subset. This does not single out a title font. – mkl Feb 02 '15 at 12:12
  • @mkl so you mean that could be any six random letters for any pdf? – user3049183 Feb 02 '15 at 12:25
  • More exactly the specification calls that **a tag followed by a plus sign (+). The tag shall consist of exactly six uppercase letters; the choice of letters is arbitrary, but different subsets in the same PDF file shall have different tags.** – mkl Feb 02 '15 at 13:15
  • @mkl got it. how about the `MT` after the `Bold`? I tried with different pdfs, some doesnt have any character after `Bold` or any other font type, some didnt make it to be complete. I mean, it was like just `medi` and `Regu` instead of the complete format. why is that so? it makes me confuse to program a code which fits for all cases – user3049183 Feb 02 '15 at 15:18
  • *how about the MT after the Bold?* - Thats the name of the font, [Arial-BoldMT](http://www.fontslog.com/arial-boldmt-otf-15234.htm)... *a code which fits for all cases* - probably impossible... there are conventions for font names, but they are not hard rules. Thus, there always will be someone naming a font to not match your expectations. – mkl Feb 02 '15 at 15:51
  • @mkl so, how is it possible to detect titles in this case? just by detecting Bold? – user3049183 Feb 07 '15 at 02:25
  • Incorrect a representative collection of PDFs and find criteria to recognize their titles. There is no criterion recognizing all titles without error. – mkl Feb 07 '15 at 07:56
  • "Incorrect" should have been "inspect"... Smart phone keyboards. – mkl Feb 07 '15 at 09:34

0 Answers0