Unable to parse pdf by Jpedal

Question

I'm facing a problem while parsing a PDF with Jpedal.

While reading the wordlist from the Jpedal, I get garbled characters in the wordslist. This also happens when using OCR, and when I copy the text from PDF and paste in Word or a simple text editor. What I understand is this PDF was generated by Quartz PDF context on MAC OS X 10.6.4, which is used to compress the file size, but iseasily viewable on PDF viewers. I searched for any Java API supporting for decoding this kind of PDF but was unsuccessful. I'm looking for any application or Java API which I can use to decode it; must be usable on a Linux machine.

How is the font embedded? If it won't work from Acrobat then you're probably out of luck: that probably means you've got a cut-down embedded font with no glyph <=> unicode mapping in the PDF. So you'll either need to edit one in or manually match up the characters after the fact. — Rup, Jul 02 '10 at 14:49
Hye Rup I read your comment and tried this by getting the integer value of the character in string. Decrement the integer values so get something like ASCIIs (it is like simple cesar cipher substitution algorithm) but I'm not sure that the difference value is same for every other PDF generated by Quartz that is to be tested. Thanks for comment. Thank you — la89ondevg, Jul 03 '10 at 09:40
No, it probably won't be, sorry. I'd guess Quartz is stripping the font down to only contain the characters that actually get used, which may be different for every document. If you're lucky it might keep them in ASCII / Unicode order but it may well also not. If all of your documents use the same font it might be possible to extract the individual glyph data and look that up from a cache of glyph -> character code mapping, but I don't have good ideas beyond that. — Rup, Jul 03 '10 at 10:58

score 1 · Answer 1 · answered Jul 16 '10 at 13:49

1

Hye everybody

I'm posting a possible solution for problem. Here is link describing how quartz parse the pdf and of course which need to be implemented in code cause till now I didn't found any readymade API for it and I believe that stackoverflow is all about taking initiative and do and answer the questions which not been done or asked before.

regards

Rituraj

answered Jul 16 '10 at 13:49

la89ondevg

117
1
2
10

I'm posting this answer for those who have enough time to do it – la89ondevg Jul 16 '10 at 13:50

Unable to parse pdf by Jpedal

1 Answers1