42

I have been playing around with PdfBox and PDFTextStripperByArea method.

I was able to extract information if the text is bold or italic, but I'm unable to get the underline information.

As far as I understand it in PDF, underline is done by drawing lines. So in theory I should be able to get some sort of information about lines somewhere around the text. Giving this information I could then find out if either text is underlined or in a table.

Here is my code so far:

List<TextPosition> textPos = charactersByArticle.get(index);

for (TextPosition t : textPos)
{               
    if (t.getFont().getFontDescriptor() != null)
    {                           
        if (t.getFont().getFontDescriptor().getFontWeight() > BOLD_WEIGHT ||
            t.getFont().getFontDescriptor().isForceBold())
        {
            isBold = true;
        }

        if (t.getFont().getFontDescriptor().isItalic())
        {
            isItalic = true;
        }
    }
}

I have tried to play around the PDGraphicsState object which is processed in the processEncodedText method in PDFStreamEngine class but no information of lines found there.

Any suggestions where this information could be retrieved from ?

Drejc
  • 14,196
  • 16
  • 71
  • 106
  • 1
    Drejc : Good question. I am stuck with the same problem when I was working with pdfbox, while converting pdf to html. But I solved this issue by considering underlines as a part of background image. I think this will not work in your case. If we got the x,y cordinates of the lines it will be really good – Neeraj Dec 20 '12 at 06:50
  • 3
    I suggest you try out another PDF processing library. Underlining is not an attribute of a font such as font weight (bold) or shape (italics) but rather a graphical object put below the text. I've been reading the PDFBox API and it looks like you can get all the graphic objects. So you would have to write a program that calculates the coordinates of something that looks like a line and then see if it is below some of the text. That's rather tedious. But I have never used PDFBox before, so I'm not an expert. – alexkelbo Dec 26 '12 at 22:07
  • I'm aware of the fact that lines are not part of the text but graphical object. Switching libraries is not an option, plus the alternatives don't provide this functionality either (as far as I could find out). I can tweak PdfBox to get to the graphics while extracting text but it is a lot of work and a lot of trial and error. I was hoping someone already did this. – Drejc Dec 27 '12 at 07:58
  • Have a look a TextOutputDev in xpdf codebase. It handles the line/underline problem by using pseudo text segmentation. Results ain't 100% ok but quite good – user18428 Feb 28 '13 at 08:09
  • Will certainly do this ... as soon as I get to it. – Drejc Feb 28 '13 at 12:10
  • 1
    @Neeraj As you offered a bounty, you might have some types of underlines on your mind. I just looked at PDFs produced by MS Word, and they actually use filled slim rectangles instead of lines to underline or strike through. Maybe your PDFs use different techniques. Thus you (or Drejc) should post some sample documents in which you expect underlines to be recognized... (Be aware, though, that it still is an insecure matter because the rectangles are not in any way other than their position on the page linked to the words they underline) – mkl Mar 05 '13 at 09:18
  • @mkl I am expecting a general way to find all the lines in pdf. As Drejc said now all the lines are part of graphical object. So we will not get any line information including underlines and table lines. You can take any pdf that contain tables and underlines. As I said before I faced this problem while I was working in a project "pdf to html conversion" for azzist.com. Since all lines are part of graphical object I was able to solve this issue by taking it as part of background image. But now I am working in resume parsing. So now these details are very important. – Neeraj Mar 06 '13 at 04:44
  • 2
    @Neeraj if you really want to recognize any *possible* way such lines can theoretically be created, you'd have to also consider lines which are part of some background bitmap image. Thus, you are back to image analysis, which is something you don't want to have to do. If, on the other hand, it suffices to recognize underlines as actually used by far enough spread software, things are easier but one has to define which software to take into account our which pdfs to take into account. – mkl Mar 06 '13 at 05:27
  • @mkl But for pdfBox they are not simple lines I think. Because while we extracting text, they are able to detect the table structure and they extract text in block wise manner. Another option is, they might be finding the distance of each characters and thus they are forming some block structure.. I don,t know how they are able to accomplish this. I just want to know the blocks inside pdf. Now table structures are headache while analysing pdf. If we got the table structure through the line information, it will be very helpfull. – Neeraj Mar 06 '13 at 09:28
  • @Drejc https://github.com/rosslagerwall/poppler/blob/master/poppler/TextOutputDev.cc . – user18428 Mar 06 '13 at 10:25

5 Answers5

5

Here is what I have found out so far:

PDFBox uses a resource file to bound PDF operators/instructions to certain classes which then process the information.

If we take a look at the PDFTextStripper.properties resource file under:

pdfbox\src\main\resources\org\apache\pdfbox\resources\

we can see that for instance the BT operator is bound to the org.apache.pdfbox.util.operator.BeginText class and so on.

The PDFTextStripper under

pdfbox\src\main\java\org\apache\pdfbox\util\

takes this into account and utilizes the processing of the PDF with this classes.

BUT all graphical objects are ignored, therefore no information of underline or table structure!

Now if we take a look at the PageDrawer.properties resource file we can see that this one bounds to almost all operators available. Which is utilized by PageDrawer class under

pdfbox\src\main\java\org\apache\pdfbox\pdfviewer\

The "trick" is now to find out which graphical operators are those who represent underline and tables and to use them in combination with PDFTextStripper.

Now this would mean reading the PDF file specification, which is currently way to much work.

If someone knows which operators are responsible for which actions to draw underlines and table lines please let me know.

Drejc
  • 14,196
  • 16
  • 71
  • 106
  • PDF format is like this -- that's the problem. It represents graphical implementation, not semantics. Even reconstructing sentences is difficult -- words or even individual characters are positioned, the 'space' or 'newline' must be algorithmically reconstructed. In short, Adobe are a*holes. – Thomas W Mar 02 '13 at 01:47
  • However, you can accomplish your requirement -- if you are willing to put, say, 12-20 hours of work in. Copying PDFBox classes & extending the internals is often a good starting point. As well as detecting by position, underlines will typically be emitted immediately after their text I expect! – Thomas W Mar 02 '13 at 01:50
  • Also try constructing a trivial two-line PDF with underlined text, and see what you can hook coming back in! Should be easy to see it there. Enjoy. – Thomas W Mar 02 '13 at 01:52
  • The task I was performing was a read in and transformation one. I have managed to get almost everything right except tables and underlines. Special characters (like accented ones) are also a pain, so I have some heuristics in place which act according to document type. The things which can't be done are now edited by hand (as a final visual check is always needed). I have abandoned the lines feature as time vs benefit is to low. – Drejc Mar 04 '13 at 13:35
2

As you mention -- PDFBox uses resource files, to bind PDF operators/ instructions to visitors which will process the information.

You'd probably best start by copying PDFBox's existing visitor into your own source-folder, and then adding/ extending the implementation from there.

My long-ago PostScript experience recalls 'moveto' and 'lineto' operators. Since PDF is roughly PS-based, you'll be looking for something similar.

http://learnpostscript.wordpress.com/category/lineto/

PDF format is a b*tch -- it's HTML, done wrong. It represents graphical implementation, not semantics. Even reconstructing sentences is difficult -- words or even individual characters are positioned, the 'space' or 'newline' must be algorithmically reconstructed. In short, Adobe are a*holes. And Reader is an non-ergonomic, bug-riddled, insecure, bloated pig.

However, you can accomplish your requirement -- if you are willing to put, say, 12+ hours of work in. As well as detecting by position, underlines will typically be emitted in the PDF immediately after their text.. so you can latch your detection by PDF document-order, not just page position.

Also, try constructing a trivial two-line PDF with underlined text. Then see what you can make of it, parsing it back in! The underline should stick out like dog's bananas, and once you can detect that, you'll be well on the way.

PDFBox is not very good for extensibility, it's mainly just a big pile of algorithms. For this reason, just copy the PDFTextStripper source (and maybe have PageDrawer for reference) and prototype from there.

Hope this helps!

Thomas W
  • 13,940
  • 4
  • 58
  • 76
  • 1
    It does ... I have already played around with it. The PDFBox code is really messy and I got the feeling some operations are performed multiple times for no obvious reason. Currently the effort involved extracting a simple underline is way to much work, porting the code to some other library is an option but also out of scope. – Drejc Mar 04 '13 at 13:38
  • I don't think you'll get any more joy out of any other library.. count me surprised if you do. Even identifying _words_ in PDF is difficult. The format is ridiculously badly designed. – Thomas W Mar 14 '13 at 11:11
1

you can use Itext to generate pdf reports.

by using itext you can able to put the lines in easy way.

try the follwing.

document.add(new LineSeparator(0.5f, 50, null, 0, 198));

the above code is used to generate lines in pdf report. and set the dimensions according to your choice.

hope this will help you.

Ravi
  • 19
  • 3
  • 1
    The aim is to read underlines not to create one. I have had a look at iText but could not find this feature either. Plus my code base is currently bound to PdfBox. – Drejc Feb 28 '13 at 12:09
1

As far as I have understood the pdfbox, there is no option by which you can read underline. Maybe you can try itextpdf for this purpose.

Andrew Barber
  • 39,603
  • 20
  • 94
  • 123
Rahul Munjal
  • 2,551
  • 2
  • 15
  • 35
-3

According to the api getfont() returns The font size.

You can use getStyle() method and it will return STYLE_UNDERLINE for underlined font. Thus you can retrieve underline style.

prium
  • 167
  • 11
  • 1
    There is no such property available in the font descriptor when extracting text using PDFStreamEngine. The underline (and table lines) need to be extracted as graphical objects and then somehow bound to the text flow. Simply reading the online docs won't do it. – Drejc Dec 27 '12 at 08:07