2

I have written java code to extracting data from url pdf link using pdfbox api.i have successfully got whole data in text format.but the pdf file contains article related information like title,author name and embargo date and i want to extract that not whole text data.is there any way to get only selected data from pdf using pdfbox.

URL url = new URL("http://www.example.com");
connection = (HttpURLConnection) url.openConnection();
connection.setRequestProperty("Authorization", "Basic " + encodedString);
    connection.connect();
    input = connection.getInputStream();
    FileOutputStream fos1 = new FileOutputStream("download.pdf");
    (....perform writing operation )
        File in = new File("download.pdf");
    PDFParser parser = new PDFParser(new FileInputStream(in));
                    parser.parse();
                    COSDocument cosDoc = parser.getDocument();
                    pdfStripper = new PDFTextStripper();
                    PDDocument pdDoc = new PDDocument(cosDoc);

            String parsedText = pdfStripper.getText(pdDoc);
kirti
  • 4,499
  • 4
  • 31
  • 60
Mayank
  • 51
  • 2
  • 7
  • *title,author name and embargo date and i want to extract that* - how are those data marked? Obviously such data must be marked somehow for recognition and, therefore, dedicated extraction. – mkl Oct 31 '14 at 20:37
  • The font size of title in pdf is highest as compare to other text data and it is also in bold format.you said "marked for recognition" means there should be some uniqueness in that word like email we find it by using @ or .com e.t.c are available or not. – Mayank Nov 01 '14 at 04:13
  • So *title* can be recognized by searching all text drawn using the largest effective font size. That can be implemented. The up-to-now not mentioned *email* can be recognized by the '@' character already in the string you currently extract. Do you have comparable criteria for the other fields you search? – mkl Nov 01 '14 at 21:54
  • Thanks for update.yes i can recognize email using string method.my problem is how to get title from pdf which have large text font.if you have any code for that.so please share it.it will very helpfull for me.Thanks Again – Mayank Jan 06 '15 at 17:03
  • *large text font* - Have a look at the code in [this answer](http://stackoverflow.com/a/25290318/1729265). It shows how you can use the font name during extraction, and the `TextPosition` class from which it gets the font name also has a `getFontSize` method. The sample adds the font name to the output whenever it changes, but it could also divert all text in a specific font or font size to a different data sink. – mkl Jan 07 '15 at 08:39

0 Answers0