3

I need Java Library to perform following tasks 1) Convert Pdf pages to Image 2) Extract html text from PDF pages with there locations on the page 3) Extract images from PDF pages

I have already tried

  1. PDFBox - it fails with error --unsupported/disabled operation: BDC and EMC
  2. icePDF - it works for task 1) and 3) but again its paid.
  3. PDFRenderer - it fails
  4. BFO - its paid library but able to perform tasks 1) and 3)

Can anyone suggest better solution.

Neeraj
  • 1,612
  • 7
  • 29
  • 47
Yashpal Singla
  • 1,924
  • 4
  • 21
  • 38

3 Answers3

0

Have you tried JOD Converter? It's a Java API to a self-booted Open Office Server.

To see whether it converts to/from the formats you want, just install Open Office, open a file, and try to "Save As" the format you need, to see if it's supported.

Stewart
  • 17,616
  • 8
  • 52
  • 80
0

I have followed following steps to solve the issue in Ubuntu Enviornment

Step 1) Used pdftohtml library to convert pdf to html

Step 2) Used Jsoup to extract text with styling and position from html in step 1)

Step 3) Used CutyCapt to generate snapshot of HTML (if required)

We can also use pdftoppm command to extract images directly from pdf

Yashpal Singla
  • 1,924
  • 4
  • 21
  • 38
-2

You can do all those things with PDFBox. But for getting the position there is no API. Download the latest PDFBox. Go to the following links to find your solutions.

  1. Convert Pdf pages to Image
  2. Extract images from PDF pages
  3. Extract html text from PDF pages with there locations on the page is a little bit different. Using the API you will not get the position information. But you can get all the position information using PDFBox.

Please have a look at this link. There you can see getTextPos() function. getTextPos().getXPosition(), getTextPos().getYPosition() will give you X and Y coordinates.

bummi
  • 27,123
  • 14
  • 62
  • 101
Neeraj
  • 1,612
  • 7
  • 29
  • 47
  • 1
    i have tried PDFBox already, as you can see in my post, but it results in BDC and EMC error, if you can help me in resolving that, that would be great – Yashpal Singla Nov 06 '12 at 06:58
  • @singla : Please check the above links, and try it out. I have accomplished these with pdfbox. If you are getting errors, let me know – Neeraj Nov 06 '12 at 08:14
  • @Singla :converting to image and extract images can be done directly using api. Download pdfbox and please check the link.. – Neeraj Nov 06 '12 at 08:17