I use pdfbox extraction for some information from a pdf, but how can I extract every objects information? If one of them contains the stream, how can I decode the stream to display?
Can I get the maximum fontsize from a pdf box? I think if I can get every objects fontsizes and sort them, then I get the object which has the maximum fontsize?
-
no distractions, no chit-chat (read [help→tour](http://stackoverflow.com/tour)), thanks and phrases like "could you help me" are never part of a good question – Anthon Mar 23 '15 at 06:00
1 Answers
I use pdfbox extraction some informaton of a pdf. But how can I extraction every objects' information.if one of them contains the stream, how can I decode the stream to display.
If by every object you mean everything drawn as part of the page content, these objects are contained in the page content streams and in referenced Xobject streams. You can parse these streams using a parser derived from the PDFStreamEngine
class.
That class already does most of the heavy-lifting like retrieving individual operations from the streams, managing a stack of graphic states, etc. You will have to supply some callbacks, though, for operations drawing the objects you are interested in.
To get an idea how to extend that class properly, have a look at some subclasses provided with PDFBox, e.g. PDFTextStripper
, PDFMarkedContentExtractor
, or PageDrawer
.
Can I get the maximum fontsize from a pdf box? I think if I can get every objects' fontsizes and sort them, then i get the object which has the maximum fontsize?
Indeed, you can use the above-mentioned PDFTextStripper
or more exactly, you can use a class derived from it. The text stripper as is mainly returns plain text but you can override certain of its methods and get text with additional information.
E.g. you can override writeString(String text, List<TextPosition> textPositions)
. Its standard implementation only uses the text
, i.e. the extracted plain text, but you can inspect the textPositions
, i.e. text with extra information, among them position and size.
This answer shows how to override PDFTextStripper.writeString
get access the font name. Similarly you can access the font size. Beware, there are two TextPosition
methods for this, getFontSize
and getFontSizeInPt
, and you might actually need yet another kind of size.
EDIT
In a comment, the OP asked
How can I get start with PDFSteamEngine???
As mentioned above, have a look at some subclasses provided with PDFBox. The most prominent surely is the PDFTextStripper
.
The most simple PDFTextStripper
use looks like this:
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
PDDocument document = PDDocument.load(PDF_DOCUMENT);
String text = stripper.getText(document);
document.close();
This only extracts the plain text of the document. For more specialized tasks look at these sample usages:
- ExtractTextByArea.java - PDFBox example on how to extract text from a specific area on the PDF document
- PrintTextLocations.java - PDFBox example on how to get some x/y coordinates of text
- Get font of each line using PDFBox - stackoverflow answer illustrating how to extract text with font information
- Identifying the text based on the output in PDF using PDFBOX - stackoverflow answer illustrating how to extract text with color information
- How to determine artificial bold style ,artificial italic style and artificial outline style of a text using PDFBOX - stackoverflow answer illustrating how to extract text identifying certain artificial styles
- PDF file extraction using PDFBOX for tabular data - stackoverflow answer illustrating how to extract text attempting to reflect the PDF file layout in the output
- How to check if a text is transparent with pdfbox - stackoverflow answer illustrating how to extract only text not covered by some image
More usage examples of PDFStreamEngine
and other sub-classes:
- PrintImageLocations.java - PDFBox example on how to get the x/y coordinates of image locations, based on
PDFStreamEngine
directly - Get Visible Signature from a PDF using PDFBox? - stackoverflow answer illustrating how to draw annotations, especially signature visualizations, based on
PageDrawer
How can I obtain the Textposition from a PDF???
As mentioned in my original answer, use a PDFTextStripper
and override writeString(String text, List<TextPosition> textPositions)
. Its standard implementation only uses the text
, i.e. the extracted plain text, but you can inspect the textPositions
, i.e. text with extra information, among them position and size.
-
How can I get start with PDFSteamEngine???How can I obtain the Textposition from a PDF??? – dock Mar 26 '15 at 02:59