When you convert a word document containing word form fields into a PDF (using save as *.pdf) unfortunatly there are no pdf form fields created out of it. (This would have been neat). Checkboxes are stored as characters of the MS Gothic
font. So if you want to extract them you need to extract the text of the PDF. The checkbox can have two states and thus two characters:
☐ - unicode 2610
☒ - unicode 2612
Some example code:
public static void main(String args[]) throws IOException {
InputStream pdfIs = //load your PDF
RandomAccessBufferedFileInputStream rbfi = new RandomAccessBufferedFileInputStream(pdfIs);
PDFParser parser = new PDFParser(rbfi);
parser.parse();
try (COSDocument cosDoc = parser.getDocument()) {
PDFTextStripper pdfStripper = new PDFTextStripper();
PDDocument pdDoc = new PDDocument(cosDoc);
String parsedText = pdfStripper.getText(pdDoc);
//System.out.println("Full text"+parsedText);
for (int i = 0; i < parsedText.length(); i++) {
if('☒'==parsedText.charAt(i)) {
System.out.println("Found a checked box at index "+i);
System.out.println("\\u" + Integer.toHexString(parsedText.charAt(i) | 0x10000).substring(1));
}
else if('☐'==parsedText.charAt(i)) {
System.out.println("Found an unchecked box at index "+i);
System.out.println("\\u" + Integer.toHexString(parsedText.charAt(i) | 0x10000).substring(1));
}
//else {//skip}
}
}
}
Update:
You supplied an example PDF. The checkbox is stored as an xobject stream in form of a "drawing". If you look at the page object the content entry points you in the right direction:
3 0 obj
<<
/Type /Page
/Contents 4 0 R
...
You'll find the content in the 4 0 obj
which starts with:
4 0 obj
<<
/Length 807
>>
stream
/P <</MCID 0>> BDC q
0.00000912 0 612 792 re
W* n
BT
/F1 9.96 Tf
1 0 0 1 72.024 710.62 Tm
/GS7 gs
0 g
/GS8 gs
0 G
[( )] TJ
ET
Q
EMC q
0.000018243 0 612 792 re
W* n
/P <</MCID 1>> BDC 0.72 w
0 G
1 j
73.104 696.34 9.24 9.24 re
S
0.48 w
72.984 705.7 m
82.464 696.22 l
S
82.464 705.7 m
72.984 696.22 l
S
Q
EMC /P <</MCID 2>> BDC q
0.00000912 0 612 792 re
W* n
BT
/F1 9.96 Tf
1 0 0 1 83.544 697.3 Tm
0 g
0 G
[( )] TJ
ET
and this is basically how the checkbox is drawn. You can now read this with pdfbox but you have to interpret /recognize it by yourself. Have a look at the PDF spec how those drawing instructions can be interpreted...