0

document-checkboxI have PDF file which converted from word (save as pdf) to pdf, In word we have few checkbox with selected/not selected, the converted pdf showing as checkbox but, they are not checkbox and not images.

I need to read these checkbox values(selected/not selected), but I am unable read these values. I am trying with PDFBOX. I thought these checkbox are images – tried to extract all the images in pdf, but these (showing as) checkbox are not image.

I want to know how these check box are saved in PDF, and Please let me know how can I read these checkbox values?

Suggest any API's – I will try the same.

Thanks Daya

Tilman Hausherr
  • 17,731
  • 7
  • 58
  • 97
Dayananda
  • 17
  • 9
  • 1
    Sadly, you forgot to share your PDF. Maybe the "checkboxes" are ordinary vector graphics. In that case, you'll have to use OCR or a product like ReadSoft forms (not free). – Tilman Hausherr Aug 28 '19 at 09:56
  • thanks, i have attached the image - what i did, created document with single check box and converted the word doc to pdf (using Save As option), able to extract the text, but these checkbox are not part of the text. – Dayananda Aug 29 '19 at 05:25
  • 1
    We need the PDF, not the image. In PDF there are many different ways to produce the same image. Also mention what PDFBox version you are using (hopefully 2.0.16). – Tilman Hausherr Aug 29 '19 at 07:00
  • Sorry, please find link for pdf document [link](https://github.com/dayanandabv/test/blob/master/Yes%20%20checkbox.pdf) and pdfbox version is 2.0.9, i can download 2.0.16 for testing – Dayananda Aug 29 '19 at 09:28
  • It's a vector graphic. See at/below `73.104 696.34 9.24 9.24 re` (which is a rectangle) – Tilman Hausherr Aug 29 '19 at 10:04
  • How to extract lines: https://stackoverflow.com/questions/38931422/ – Tilman Hausherr Aug 29 '19 at 10:09
  • thanks for replay, i will try it. – Dayananda Aug 29 '19 at 11:29

1 Answers1

1

When you convert a word document containing word form fields into a PDF (using save as *.pdf) unfortunatly there are no pdf form fields created out of it. (This would have been neat). Checkboxes are stored as characters of the MS Gothic font. So if you want to extract them you need to extract the text of the PDF. The checkbox can have two states and thus two characters:

☐ - unicode 2610

☒ - unicode 2612

Some example code:

public static void main(String args[]) throws IOException {
    InputStream pdfIs = //load your PDF
    RandomAccessBufferedFileInputStream rbfi = new RandomAccessBufferedFileInputStream(pdfIs);

    PDFParser parser = new PDFParser(rbfi);
    parser.parse();
    try (COSDocument cosDoc = parser.getDocument()) {
        PDFTextStripper pdfStripper = new PDFTextStripper();
        PDDocument pdDoc = new PDDocument(cosDoc);
        String parsedText = pdfStripper.getText(pdDoc);
        //System.out.println("Full text"+parsedText);

        for (int i = 0; i < parsedText.length(); i++) {
            if('☒'==parsedText.charAt(i)) {
                System.out.println("Found a checked box at index "+i);
                System.out.println("\\u" + Integer.toHexString(parsedText.charAt(i) | 0x10000).substring(1));
            }
            else if('☐'==parsedText.charAt(i)) {
                System.out.println("Found an unchecked box at index "+i);
                System.out.println("\\u" + Integer.toHexString(parsedText.charAt(i) | 0x10000).substring(1));
            }
            //else {//skip}
        }            
    }
}

Update:

You supplied an example PDF. The checkbox is stored as an xobject stream in form of a "drawing". If you look at the page object the content entry points you in the right direction: 3 0 obj << /Type /Page /Contents 4 0 R ... You'll find the content in the 4 0 obj which starts with:

4 0 obj
<<
/Length 807
>>
stream
 /P <</MCID 0>> BDC q
0.00000912 0 612 792 re
W* n
BT
/F1 9.96 Tf
1 0 0 1 72.024 710.62 Tm
/GS7 gs
0 g
/GS8 gs
0 G
[( )] TJ
ET
Q
 EMC q
0.000018243 0 612 792 re
W* n
 /P <</MCID 1>> BDC 0.72 w
0 G
 1 j 
73.104 696.34 9.24 9.24 re
S
0.48 w

72.984 705.7 m
82.464 696.22 l
S

82.464 705.7 m
72.984 696.22 l
S
Q
 EMC  /P <</MCID 2>> BDC q
0.00000912 0 612 792 re
W* n
BT
/F1 9.96 Tf
1 0 0 1 83.544 697.3 Tm
0 g
0 G
[( )] TJ
ET

and this is basically how the checkbox is drawn. You can now read this with pdfbox but you have to interpret /recognize it by yourself. Have a look at the PDF spec how those drawing instructions can be interpreted...

Lonzak
  • 9,334
  • 5
  • 57
  • 88
  • thanks for replay, i have tried that way as well, but the parsed text - don't have these checkbox chars, that's way tried if they are images, but seems they are not image as well. I have attached the image what i have did, actually i have added formfiled check box as shown attached image and converted the word doc to pdf, and used pdfbox to get text out of it and the char(checkbox) are not part of the text. – Dayananda Aug 29 '19 at 05:21
  • @Dayananda you probably should share the pdf and not count on people trying to reproduce and fix the issue fabricating the same pdf ad you have. – mkl Aug 29 '19 at 06:27
  • Sorry, please find link for pdf document [link](https://github.com/dayanandabv/test/raw/master/Yes%20%20checkbox.pdf) – Dayananda Aug 29 '19 at 09:19