1

I have recently started working with pdfbox to extract text out of pdf. Though along with text I also need to extract checkbox value show in image. I have tried different methods to find the checkbox element and extract its values.

Checkboximage

After researching the pdf text through this tool I found that the checkbox is not image or anything but some kind of graphics represented by below content.

ET
Q
q
BT
/F2 6 Tf
481.3 653.29 Td
(  ) Tj
ET
Q
q
1 1 1 rg
484.3 653.29 9 9 re
f
Q
q
0.87059 0.87059 0.87059 rg
485.05 661.54 m
492.55 661.54 l
493.3 662.29 l
484.3 662.29 l
485.05 661.54 l
f
Q
q
0.87059 0.87059 0.87059 rg
492.55 661.54 m
492.55 654.04 l
493.3 653.29 l
493.3 662.29 l
492.55 661.54 l
f
Q
q
0.87059 0.87059 0.87059 rg
492.55 654.04 m
485.05 654.04 l
484.3 653.29 l
493.3 653.29 l
492.55 654.04 l
f
Q
q
0.87059 0.87059 0.87059 rg
485.05 654.04 m
485.05 661.54 l
484.3 662.29 l
484.3 653.29 l
485.05 654.04 l
f
Q
q
BT
/F2 6 Tf
495.55 653.29 Td
(Yes) Tj
ET
Q
q
BT
/F2 6 Tf
504.88 653.29 Td
(  ) Tj
ET
Q
q
1 1 1 rg
507.88 653.29 9 9 re
f
Q
q
0.87059 0.87059 0.87059 rg
508.63 661.54 m
516.13 661.54 l
516.88 662.29 l
507.88 662.29 l
508.63 661.54 l
f
Q
q
0.87059 0.87059 0.87059 rg
516.13 661.54 m
516.13 654.04 l
516.88 653.29 l
516.88 662.29 l
516.13 661.54 l
f
Q
q
0.87059 0.87059 0.87059 rg
516.13 654.04 m
508.63 654.04 l
507.88 653.29 l
516.88 653.29 l
516.13 654.04 l
f
Q
q
0.87059 0.87059 0.87059 rg
508.63 654.04 m
508.63 661.54 l
507.88 662.29 l
507.88 653.29 l
508.63 654.04 l
f
Q
q
BT
/F2 6 Tf
519.13 653.29 Td
(No) Tj
ET
Q
q
BT
/F2 6 Tf
36.75 642.95 Td

I am not sure how to extract this out of pdf, I have seen different parser provided by pdfbox but it looks like I need to have more information about how pdf is constructed. Any pointers would be much more appreciated.

Sariq Shaikh
  • 940
  • 8
  • 29
  • If the boxes are in the page content stream and not in acroform then you'll have to do OCR. – Tilman Hausherr Sep 28 '20 at 03:45
  • If the only paths drawn in your pdfs are those that create ticked or un-ticked check boxes and if they always are drawn the same way, you can try and analyze them, too, during text extraction. But this really only makes sense under those preconditions. – mkl Sep 28 '20 at 05:01
  • Thank you, I was thinking about above two options as a solution. Though I was wondering how pdf viewer are able to recognise it and render it properly ? I tried java based pdf viewer to view it and it works fine so how it's able to parse and render it properly if we can not identify the details through APIs. – Sariq Shaikh Sep 28 '20 at 05:18
  • The content you posted are the rendering commands. The source code in PDFBox parses these and renders them. See also https://stackoverflow.com/questions/38931422/ for how to identify paths. – Tilman Hausherr Sep 28 '20 at 07:44
  • *"how it's able to parse and render it properly if we can not identify the details through APIs"* - Each of those **m - l - l - l - l -f** sequences appears to draw one edge of a box. Your excerpt does not yet contain the instructions for drawing the checkmark, merely the boxes. Similarly it is easy in code to recognize that some lines are drawn. In the rendering case the recognition that such lines form a check box (checked or not) happens in the brain of the viewing human, and that is the more difficult part to put into code. – mkl Sep 28 '20 at 08:18
  • Thanks @mkl and Tilman Hausherr, got the point about the graphics being drawn using these instructions. Now what do you suggest to extract these details. OCR is the last option I would like to keep as its comparatively slow and error prone in my case. – Sariq Shaikh Sep 28 '20 at 08:21
  • 1
    Concerning the tool: There is a similar tool in the PDFBox portfolio, called [PDFDebugger](https://pdfbox.apache.org/2.0/commandline.html#pdfdebugger). – mkl Sep 28 '20 at 08:22
  • 1
    *"Now what do you suggest to extract these details."* - first of all, what is your input: Is it homogeneous and all check boxes and check marks are drawn identically (except their actual coordinates)? Or are there different instruction sequences to draw them? In the former case please share an example PDF with such check boxes and marks, probably we can give hints how to implement a recognition thereof. In the latter case, though, that approach makes little sense (unless you have a lot of time to implement it). OCR might be worth a try (if the rendered boxes at least *look* similar). – mkl Sep 28 '20 at 08:29
  • I think it is the first case based on content I am getting in rups. It's generated using itextsharp latest version so I am thinking it's checkbox component applied. Somone having itextsharp knowledge can tell that based on how they generate the pdf. I will share sample pdf to look into the problem closely. Thank you for your inputs. – Sariq Shaikh Sep 28 '20 at 10:13
  • @mkl I have added sample pdf link in the question itself. – Sariq Shaikh Sep 28 '20 at 10:42
  • Itext does not by default create check boxes in the content stream using vector graphics like that. Probably an original pdf was created by a different program and merely flattened by itext. In that case the constructs of the original pdf remain in a way. – mkl Sep 28 '20 at 11:34
  • As an aside: That online tool you used appears to only be some PDF editor but not a PDF redactor. There is lots of stuff in your file not visible in a PDF viewer but visible for anyone looking into the internals using tools like iText RUPS or PDFBox PDFDebugger. If you need to keep those "few legal details" secret, you had better use a different tool. – mkl Sep 28 '20 at 16:03

1 Answers1

3

In a comment you confirm that

all check boxes and check marks are drawn identically

in your input documents.

To extract the check boxes and their check state from your document, therefore, you can search the page content exactly for instruction sequences drawing the boxes and marks therein like in the example document.

How Boxes And Check Marks Are Drawn

As you already found out, the boxes are drawn by filling one path for each edge (top, right, bottom, left) respectively like this in case of the "yes" box for question 1:

485.05 661.54 m
492.55 661.54 l
493.3 662.29 l
484.3 662.29 l
485.05 661.54 l
f
...
492.55 661.54 m
492.55 654.04 l
493.3 653.29 l
493.3 662.29 l
492.55 661.54 l
f
...
492.55 654.04 m
485.05 654.04 l
484.3 653.29 l
493.3 653.29 l
492.55 654.04 l
f
...
485.05 654.04 m
485.05 661.54 l
484.3 662.29 l
484.3 653.29 l
485.05 654.04 l
f

Inspecting all the boxes in the document you can see that their drawing instructions follow this pattern:

A B m
(A+7.5) B l
(A+8.25) (B+0.75) l
(A-0.75) (B+0.75) l
A B l
f
...
C B m
C (B-7.5) l
(C+0.75) (B-8.25) l
(C+0.75) (B+0.75) l
C B l
f
...
C D m
(C-7.5) D l
(C-8.25) (D-0.75) l
(C+0.75) (D-0.75) l
C D l
f
...
A D m
A (D+7.5) l
(A-0.75) (D+8.25) l
(A-0.75) (D-0.75) l
A D l
f 

Here A and C are the left and right x coordinates of the box and B and D are the top and bottom y coordinates thereof.

Similarly the check marks are drawn by filling two paths (left and right half) respectively like this in case of the mark in the "yes" box for question 1:

0.70711 -0.70711 0.70711 0.70711 -323.79 536.88 cm
...
489.55 661.54 m
489.55 657.79 l
490.3 657.04 l
490.3 661.54 l
489.55 661.54 l
f
...
489.55 657.79 m
488.05 657.79 l
488.05 657.04 l
490.3 657.04 l
489.55 657.79 l
f

Inspecting all the check marks in the document you can see that their drawing instructions follow this pattern:

0.70711 -0.70711 0.70711 0.70711 X Y cm 
...
A B m
A (B-3.75) l
(A+0.75) (B-4.5) l
(A+0.75) B l
A B l
f 
...
A C m
(A-1.5) C l
(A-1.5) (C-0.75) l
(A+0.75) (C-0.75) l
A C l
f 

The first line transforms the coordinate system by rotating it by 45° around some point; this allows to draw the check mark using mostly horizontal and vertical lines.

In this rotated coordinate system (A,B) are the coordinates of the left top corner of the longer check mark arm and (A,C) are those of upmost point of of the line where the two arms of the check mark join.

How to Search for Those Instruction Sequences

A related task has been implemented in the PdfBoxFinder class in this answer, a class that collects lines drawn as thin, long rectangles forming a grid.

Thus, we can use the same foundation, the PDFBox PDFGraphicsStreamEngine class, in our case. We merely have to look at different kinds of paths (built by move-to and line-to instructions, not be rectangle instructions) and of course process the paths differently (instead of recognizing a grid, we must recognize our specific check boxes and check marks).

Such a check box finder class can be implemented like this:

public class PdfCheckBoxFinder extends PDFGraphicsStreamEngine {
    public class CheckBox {
        public Point2D getLowerLeft()   {   return lowerLeft;   }
        public Point2D getUpperRight()  {   return upperRight;  }
        public boolean isChecked()      {   return checked;     }

        CheckBox(Point2D lowerLeft, Point2D upperRight, boolean checked) {
            this.lowerLeft = lowerLeft;
            this.upperRight = upperRight;
            this.checked = checked;
        }

        final Point2D lowerLeft;
        final Point2D upperRight;
        final boolean checked;
    }

    public PdfCheckBoxFinder(PDPage page) {
        super(page);
        for (int i = 0; i < pathAnchorsByType.length; i++)
            pathAnchorsByType[i] = new ArrayList<Point2D>();
    }

    public List<CheckBox> getBoxes() {
        if (checkBoxes.isEmpty()) {
            for (Point2D anchor : pathAnchorsByType[PathType.boxBottom.index]) {
                if (containsApproximatly(pathAnchorsByType[PathType.boxLeft.index], anchor) &&
                        containsApproximatly(pathAnchorsByType[PathType.boxRight.index], anchor) &&
                        containsApproximatly(pathAnchorsByType[PathType.boxTop.index], anchor)) {
                    Point2D upperRight = new Point2D.Float(7.5f + (float)anchor.getX(), 7.5f + (float)anchor.getY());
                    boolean checked = containsInRectangle(pathAnchorsByType[PathType.checkLeft.index], anchor, upperRight) &&
                            containsInRectangle(pathAnchorsByType[PathType.checkRight.index], anchor, upperRight);
                    checkBoxes.add(new CheckBox(anchor, upperRight, checked));
                }
            }
        }
        return Collections.unmodifiableList(checkBoxes);
    }

    boolean containsApproximatly(List<Point2D> points, Point2D anchor) {
        for (Point2D point : points) {
            if (approximatelyEquals(point.getX(), anchor.getX()) && approximatelyEquals(point.getY(), anchor.getY()))
                return true;
        }
        return false;
    }

    boolean containsInRectangle(List<Point2D> points, Point2D lowerLeft, Point2D upperRight) {
        for (Point2D point : points) {
            if (lowerLeft.getX() < point.getX() && point.getX() < upperRight.getX() &&
                    lowerLeft.getY() < point.getY() && point.getY() < upperRight.getY())
                return true;
        }
        return false;
    }

    //
    // PDFGraphicsStreamEngine overrides
    //
    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException {
        moveTo((float) p0.getX(), (float) p0.getY());
        path.add(new Rectangle(p0, p1, p2, p3));
    }

    @Override
    public void moveTo(float x, float y) throws IOException {
        currentPoint = new Point2D.Float(x, y);
        currentStartPoint = currentPoint;
    }

    @Override
    public void lineTo(float x, float y) throws IOException {
        Point2D point = new Point2D.Float(x, y);
        path.add(new Line(currentPoint, point));
        currentPoint = point;
    }

    @Override
    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException {
        Point2D point1 = new Point2D.Float(x1, y1);
        Point2D point2 = new Point2D.Float(x2, y2);
        Point2D point3 = new Point2D.Float(x3, y3);
        path.add(new Curve(currentPoint, point1, point2, point3));
        currentPoint = point3;
    }

    @Override
    public Point2D getCurrentPoint() throws IOException {
        return currentPoint;
    }

    @Override
    public void closePath() throws IOException {
        path.add(new Line(currentPoint, currentStartPoint));
        currentPoint = currentStartPoint;
    }

    @Override
    public void endPath() throws IOException {
        clearPath();
    }

    @Override
    public void strokePath() throws IOException {
        clearPath();
    }

    @Override
    public void fillPath(int windingRule) throws IOException {
        processPath();
    }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException {
        clearPath();
    }

    @Override public void drawImage(PDImage pdImage) throws IOException { }
    @Override public void clip(int windingRule) throws IOException { }
    @Override public void shadingFill(COSName shadingName) throws IOException { }

    //
    // internal representation of a path
    //
    interface PathElement {
    }

    class Rectangle implements PathElement {
        final Point2D p0, p1, p2, p3;

        Rectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {
            this.p0 = p0;
            this.p1 = p1;
            this.p2 = p2;
            this.p3 = p3;
        }
    }

    class Line implements PathElement {
        final Point2D p0, p1;

        Line(Point2D p0, Point2D p1) {
            this.p0 = p0;
            this.p1 = p1;
        }
    }

    class Curve implements PathElement {
        final Point2D p0, p1, p2, p3;

        Curve(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {
            this.p0 = p0;
            this.p1 = p1;
            this.p2 = p2;
            this.p3 = p3;
        }
    }

    Point2D currentPoint = null;
    Point2D currentStartPoint = null;

    void clearPath() {
        path.clear();
        currentPoint = null;
        currentStartPoint = null;
    }

    void processPath() {
        for (PathType pathType : PathType.values()) {
            if (pathType.matches(path)) {
                pathAnchorsByType[pathType.index].add(pathType.getAnchor(path));
            }
        }

        clearPath();
    }

    enum PathType {
        boxTop(new float[] {7.5f, 0f, .75f, .75f, -9f, 0f, .75f, -.75f}, new float[] {0f, -7.5f}, 0),
        boxRight(new float[] {0f, -7.5f, .75f, -.75f, 0f, 9f, -.75f, -.75f}, new float[] {-7.5f, -7.5f}, 1),
        boxBottom(new float[] {-7.5f, 0f, -.75f, -.75f, 9f, 0f, -.75f, .75f}, new float[] {-7.5f, 0f}, 2),
        boxLeft(new float[] {0f, 7.5f, -.75f, .75f, 0f, -9f, .75f, .75f}, new float[] {0f, 0f}, 3),
        checkRight(new float[] {-2.65165f, -2.65165f, 0f, -1.06066f, 3.18198f, 3.18198f, -.53033f, .53033f}, new float[] {-2.65165f, -2.65165f/*-5.1072f, -4.4559f*/}, 4),
        checkLeft(new float[] {-1.06066f, 1.06066f, -.53033f, -.53033f, 1.59099f, -1.59099f, 0f, 1.06066f}, new float[] {0f, 0f/*-2.4556f, -1.8042f*/}, 5)
        ;
        PathType(float[] diffs, float[] offsetToAnchor, int index) {
            this.diffs = diffs;
            this.offsetToAnchor = offsetToAnchor;
            this.index = index;
        }

        boolean matches(List<PathElement> path) {
            if (path != null && path.size() * 2 == diffs.length) {
                for (int i = 0; i < path.size(); i++) {
                    PathElement element = path.get(i);
                    if (!(element instanceof Line))
                        return false;
                    Line line = (Line) element;
                    if (!approximatelyEquals(line.p1.getX() - line.p0.getX(), diffs[i*2]))
                        return false;
                    if (!approximatelyEquals(line.p1.getY() - line.p0.getY(), diffs[i*2+1]))
                        return false;
                }
                return true;
            }
            return false;
        }

        Point2D getAnchor(List<PathElement> path) {
            if (path != null && path.size() > 0) {
                PathElement element = path.get(0);
                if (element instanceof Line) {
                    Line line = (Line) element;
                    Point2D p = line.p0;
                    return new Point2D.Float((float)p.getX() + offsetToAnchor[0], (float)p.getY() + offsetToAnchor[1]);
                }
            }
            return null;
        }

        final float[] diffs;
        final float[] offsetToAnchor;
        final int index;
    }

    static boolean approximatelyEquals(double f, double g) {
        return Math.abs(f - g) < 0.001;
    }

    //
    // members
    //
    final List<PathElement> path = new ArrayList<>();

    final List<Point2D>[] pathAnchorsByType = new List[PathType.values().length];

    final List<CheckBox> checkBoxes = new ArrayList<>(); 
}

(PdfCheckBoxFinder)

You can use the PdfCheckBoxFinder like this to find the check boxes of a document and their checked states:

PDDocument document = ...
for (PDPage page : document.getPages())
{
    PdfCheckBoxFinder finder = new PdfCheckBoxFinder(page);
    finder.processPage(page);
    for (CheckBox checkBox : finder.getBoxes()) {
        Point2D ll = checkBox.getLowerLeft();
        Point2D ur = checkBox.getUpperRight();
        String checked = checkBox.isChecked() ? "checked" : "not checked";
        System.out.printf(Locale.ROOT, "* (%4.3f, %4.3f) - (%4.3f, %4.3f) - %s\n", ll.getX(), ll.getY(), ur.getX(), ur.getY(), checked);
    }
}

(ExtractCheckBoxes test testExtractFromUpdatedForm)

For your example PDF one gets

* (485.050, 654.040) - (492.550, 661.540) - checked
* (508.630, 654.040) - (516.130, 661.540) - not checked
* (485.050, 641.760) - (492.550, 649.260) - checked
* (508.630, 641.760) - (516.130, 649.260) - not checked
* (485.050, 629.490) - (492.550, 636.990) - not checked
* (508.630, 629.490) - (516.130, 636.990) - checked
* (485.050, 617.220) - (492.550, 624.720) - checked
* (508.630, 617.220) - (516.130, 624.720) - not checked
* (485.050, 593.700) - (492.550, 601.200) - checked
* (508.630, 593.700) - (516.130, 601.200) - not checked
* (485.050, 581.420) - (492.550, 588.920) - checked
* (508.630, 581.420) - (516.130, 588.920) - not checked
* (485.050, 569.150) - (492.550, 576.650) - checked
* (508.630, 569.150) - (516.130, 576.650) - not checked
* (91.330, 553.500) - (98.830, 561.000) - not checked
* (125.570, 553.500) - (133.070, 561.000) - not checked
* (200.150, 553.500) - (207.650, 561.000) - not checked
* (286.220, 553.500) - (293.720, 561.000) - not checked
* (77.190, 331.430) - (84.690, 338.930) - not checked

(The coordinates are in the natural coordinate system given by the crop box of the PDF page in question. To relate to coordinates from the PDFTextStripper a transformation into the proprietary coordinate system of the text stripper may be necessary.)

Beware, though, as said at the start the code above only works for check boxes and check marks built exactly as in your example PDF. You confirmed that this would be the case but probably you will be surprised.

If you actually encounter a (very!) few variations thereof, you can add PathType entries matching all of them and enhance getBoxes accordingly to recognize all those variations.

If you happen to come across more than only a few variations, you should go for OCR.

How to Combine the Check Boxes With Text Extraction

In a comment you proposed

is there a possibility if I can remove the graphics and replate it with some text for an example C or 'N' then I can do text extraction of the newly generated pdf

Indeed, one can simply add textual marks for check and unchecked check boxes to the page and then apply text extraction to get the text including the marks. I would propose, though, to use DingBats like ✔ and ✗. This can be done like this:

PDDocument document = ...;
PDType1Font font = PDType1Font.ZAPF_DINGBATS;
for (PDPage page : document.getPages())
{
    PdfCheckBoxFinder finder = new PdfCheckBoxFinder(page);
    finder.processPage(page);
    for (CheckBox checkBox : finder.getBoxes()) {
        Point2D ll = checkBox.getLowerLeft();
        Point2D ur = checkBox.getUpperRight();
        String checkBoxString = checkBox.isChecked() ? "\u2714" : "\u2717";
        try (   PDPageContentStream canvas = new PDPageContentStream(document, page, AppendMode.APPEND, false, true)) {
            canvas.beginText();
            canvas.setNonStrokingColor(1, 0, 0);
            canvas.setFont(font, (float)(ur.getY()-ll.getY()));
            canvas.newLineAtOffset((float)ll.getX(), (float)ll.getY());
            canvas.showText(checkBoxString);
            canvas.endText();
        }
    }
}
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
String text = stripper.getText(document);

(ExtractCheckBoxes test testExtractInlinedInTextFromUpdatedForm)

For your example PDF one gets

1. Have you met or discussed with principal life to be assured?   ✔ Yes  ✗ No
2. Is the principal life to be assured an existing bank customer?   ✔ Yes  ✗ No
3. Are you related to the proposed Life to be Assured? If yes, please state your relationship with applicant   ✗ Yes  ✔ No
4. Are you satisfied with the financial standing of the proposed Life to be Assured?   ✔ Yes  ✗ No
   What is the estimated annual income of the Life to be Assured? 600000
...
mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thank you @mkl for the answer, really appreciate for the details provided for the graphics redered in pdf. Still few items for `PDFGraphicsStreamEngine` will require some reading from my side to understand its functions in detail. I will test with few sample pdf to make sure checkboxes and check marks are built exactly with the same line strokes. And also try to relate `PDFTextStripper` output to get the labels of check boxes accordingly. – Sariq Shaikh Oct 01 '20 at 07:46
  • I wanted to know rather than matching labels through coordinates, is there a possibility if I can remove the graphics and replate it with some text for an example `C` or 'N' then I can do text extraction of the newly generated pdf. Right now I am trying on same approach not sure if it will be simpler approach or not. – Sariq Shaikh Oct 01 '20 at 07:47
  • I first would have thought of somehow merging the results of the `PdfCheckBoxFinder` with those of the `PDFTextStripper` somehow; this is complicated, though, because of the proprietary coordinate system of the text stripper. Your idea of of drawing some text at the position of the determined check box and thereafter using pure text extraction most likely is easier! – mkl Oct 01 '20 at 09:42
  • Thank you for the update, stackoverflow doenst notify about the updates to the answer. I came back to check the answer once again because whatever I write it was appending to the end of the stream which was logical considering APPEND flag was used to update the stream. I was thinking to sort them by their position with the code but `stripper.setSortByPosition(true);` did the trick. Thank you very much once again. – Sariq Shaikh Oct 07 '20 at 21:50