1

i want to convert a PDF file having few images into xml using java.

Is there any api though which it can be done so that all the images and text of pdf will be converted into xml file.

please help.

pdrryy
  • 21
  • 1
  • 2

4 Answers4

2

Use pdftohtml.

It can be installed with brew install pdftohtml. This adds pdftohtml to your path.

So, to convert pdf to xml, you can run pdftohtml -xml your_file.pdf your_file.xml

Then, just use java or any other language to execute this command.

Flaviu
  • 6,240
  • 4
  • 35
  • 33
1

PDF is one of the worst format to work with. It is designed for rendering 2D graphics and text documents. There are libraries which allow you to manipulate PDF objects in PDF document but it will not be able to tell you whether an image is related to which paragraph. You will not be able to extract the semantic of it easily.

On the other hand, XML is desinged to store text data in a well structured manner. This means it contains implicit semantic. In order to convert from a format which does not have semantic to a format which have implicit you will need to add your own logic into the conversion process otherwise you will just end up having a mess in your XML which contradicts the whole purpose of using XML.

Since each PDF document is very much different, it is almost impossible to automate this without human aids.

If you are really determine to do it, I suggest you use a library to read PDF into objects, and start writing a converter from there. You will have to take care of newpage, newline, page number, headers, images, graphics, tables, and many more by yourself. Since XML is made mainly for text data, you will have to deal with graphics somehow if you want to store in XML, e.g. convert graphics into Base64 string.

gigadot
  • 8,879
  • 7
  • 35
  • 51
0

iText is a library that allows you to create and manipulate PDF documents. It enables developers looking to enhance web- and other applications with dynamic PDF document generation and/or manipulation. Developers can use iText to:

* Serve PDF to a browser
* Generate dynamic documents from XML files or databases
* Use PDF's many interactive features
* Add bookmarks, page numbers, watermarks, etc.
* Split, concatenate, and manipulate PDF pages
* Automate filling out of PDF forms
* Add digital signatures to a PDF file

iText is available in Java as well as in C#.

0

You could Base64 encode the entire PDF file's byte stream and serialize it into an XML document like "<pdf><![CDATA[BASE64ENCODEDPDFFILECONTENTS...]]></pdf>". =)

maerics
  • 151,642
  • 46
  • 269
  • 291