0

Using c#, I want to see if a specific check box is checkd on a PDF page. The PDF file is not a form one.

PDF could be something like: enter image description here

Sample file is here: MDS30ResidentP2.pdf (in this sample file, I want to somehow figure it out that check-box "E" in the question A1000 is checked. Again: the PDF is not in "form" format!).

PS: none of the following posts was solved my problem:

Community
  • 1
  • 1
Tohid
  • 6,175
  • 7
  • 51
  • 80
  • So something like [OCR](http://en.wikipedia.org/wiki/Optical_character_recognition)? – gunr2171 Aug 08 '14 at 19:15
  • 1
    OCR is probably the only way. From the PDF perspective, there's a rectangle and some of those rectangles have two lines drawn through them. They're not even images but actual vector drawing commands. You could possibly look for that extra drawing of an "x" but it is unrelated to the text that appears beside it so'd have to write some fuzzy logic to estimate what "x" goes to what "text" and I think you'd end up with a bunch of false positives. If you've got a bunch of these it might be worth writing something, otherwise OCR or manual entry. – Chris Haas Aug 08 '14 at 19:36
  • @ChrisHaas - So if a can somehow get the position of that check-box and the "X" in it, I can figure out the rest. Do you know how can I do that? Any sample code? – Tohid Aug 08 '14 at 20:16
  • You can try [something like this](http://stackoverflow.com/a/8744643/231316) which is a little ugly but if you're parsing the same PDF over and over again it might work ok. If you want something more generic and reusable I would check out the creator of iText's [post here](http://stackoverflow.com/a/16961918/231316). His post is for optional content groups but it should give you some ideas to start with. – Chris Haas Aug 09 '14 at 15:35
  • Thank you @ChrisHaas . I'm working on it now and I think I'm in good direction, thanks to you. Please merge your two comments and enter it as an answer, I'll mark them as the correct answer. It will help people with the same question. – Tohid Aug 11 '14 at 13:10

1 Answers1

1

OCR is probably the only way. From the PDF perspective, there's a rectangle and some of those rectangles have two lines drawn through them. They're not even images but actual vector drawing commands. You could possibly look for that extra drawing of an "x" but it is unrelated to the text that appears beside it so'd have to write some fuzzy logic to estimate what "x" goes to what "text" and I think you'd end up with a bunch of false positives. If you've got a bunch of these PDFs it might be worth writing something, otherwise OCR or manual entry.

If you want to parse the PDF you can try something like this which is a little ugly but if you're parsing the same PDF over and over again it might work OK. If you want something more generic and reusable I would check out the creator of iText's post here. His post is for optional content groups but it should give you some ideas to start with.

Community
  • 1
  • 1
Chris Haas
  • 53,986
  • 12
  • 141
  • 274