1

I am having trouble with non ascii characters being returned. I am not sure at which level the issue resides. It could be the actual PDF encoding, the decoding used by CAM::PDF (which is FlateDecode) or CAM::PDF itself. The following returns a string full of the commands used to create the PDF (Tm, Tj, etc).

use CAM::PDF;

my $filename = "sample.pdf"; 
my $cam_obj = CAM::PDF->new($filename) or die "$CAM::PDF::errstr\n";
my $tree = $cam_obj->getPageContentTree(1);
my $page_string = $tree->toString();
print $page_string;

You can download sample.pdf here

The text returned in the Tj often has one character which is non ASCII. In the PDF, the actual character is almost always a quote, single or double.

While reproducing this I found that the returned character is consistent within the PDF but varies amongst PDFs. I also noticed the PDF is using a specific font file. I'm now looking into font files to see if the same character can be mapped to varying binary values.

:edit: Regarding Windows-1252. My PDF returns an "Õ" instead of apostrophes. The Õ character is hex 0xD5 in Windows-1252 and UTF-8. If the idea is that the character is encoded with Windows-1252, then it should be a hex 0x91 or 0x92 which it is not. Which is why the following does nothing to the character:

use Encode qw(decode encode);
my $page_string = 'Õ';
my $characters = decode 'Windows-1252', $page_string;
my $octets = encode 'UTF-8', $characters;
open STS, ">TEST.txt";
print STS $octets . "\n";
Michael
  • 309
  • 2
  • 3
  • 16
  • You can drastically improve your chances at getting a decent answer by [providing a test case that exhibits the problem](http://sscce.org/). – daxim May 11 '12 at 07:34
  • daxim I added examples of which text is not displaying properly. Thanks. – Michael May 11 '12 at 13:15
  • Did you read the advice I linked to? If you cannot show a PDF for which this occurs, and the whole program, not just two lines without context, it is barely possible to [reproduce the problem](http://www.chiark.greenend.org.uk/~sgtatham/bugs.html#showmehow). Include these important details, or you restrict the pool of answerers to those CAM-PDF experts who happen to be able to read your code through their magic crystal ball. – daxim May 11 '12 at 14:25
  • daxim I did not understand why you told me to do that. The Perl code as you see is very simple. The problem can now be reproduced easily with the supplied code and PDF. Thank you, in creating a sample PDF I made a few useful observations. – Michael May 11 '12 at 15:19

3 Answers3

1

I'm the author of CAM-PDF. Your PDF is non-compliant. From the PDF 1.7 specification, section 3.2.3 "String Objects":

"Within a literal string, the backslash (\) is used as an escape character for various purposes, such as to include newline characters, nonprinting ASCII characters, unbalanced parentheses, or the backslash character itself in the string. [...] The \ddd escape sequence provides a way to represent characters outside the printable ASCII character set."

If you have large quantities of non-ASCII characters, you can represent them using hexadecimal string notation.

EDIT: Perhaps my interpretation of the spec is incorrect, given a_note's alternative answer. I'll have to revisit this... Certainly, the spec could be clearer in this area.

Chris Dolan
  • 8,905
  • 2
  • 35
  • 73
  • Chris, first thank you very much for writing the module, it has been very helpful. I'm not sure I understand the implications of your answer. It leads me to believe because of the non-compliance the text cannot somehow be decoded to the original ASCII characters. I thought that may have been true, but then in a PDF editor ([CosEdit](http://www.pdftron.com/pdfcosedit/)) I saw that it correctly displayed the quotes. – Michael May 13 '12 at 04:44
  • The truth is unfortunate. Most of the good PDF viewers strive to reach parity with Adobe Reader instead of complying with the spec. That means that features like smart quotes or extraneous carriage returns get supported even though they're technically illegal. I'm just one developer, so I decided a long time ago to stay strict to the specification. That's gotten me a lot of complaints like yours. But what else can I realistically do except apologize? – Chris Dolan May 14 '12 at 04:07
  • Oh I see. Don't get my wrong I'm not trying to complain. I would do the same if I were able to code such an extensive module. Thank you for helping me understand. – Michael May 14 '12 at 04:34
  • Heh, "complaints" was too strong a word. I should have said "reports". All said, CAM::PDF should reproduce the string byte-for-byte. So, if the input PDF had illegal characters then the output PDF should also have the same illegal characters. Thanks for the nice words. – Chris Dolan May 14 '12 at 13:30
  • 1
    If the misbehaving PDF writer cannot be repaired, work around it in Perl: `use Encode qw(decode encode); my $characters = decode 'Windows-1252', $page_string; my $octets = encode 'UTF-8', $characters; print $octets;` – daxim May 14 '12 at 14:03
  • Thank you daxim but that is not working. It just prints another non ascii character... ├ – Michael May 17 '12 at 04:47
1

Sorry to intrude, and with all due respect, sir, but file IS compliant. Section 3.2.3 further states:

[The \ddd] notation provides a way to specify characters outside the 7-bit ASCII character set by using ASCII characters only. However, any 8-bit value may appear in a string.

a_note
  • 11
  • 1
  • Probably all you want is this: $page_string =~ s/[\x93\x94]/"/g; $page_string =~ s/[\x91\x92]/'/g; print $page_string; Can't add comments to upper thread – a_note May 17 '12 at 14:36
  • thank you a_note. I was thinking of doing that. The problem is future PDFs may have other characters which show up incorrectly and I cannot be constantly fixing them with this approach. I am hoping to find a solution which would work for all of these "special" characters. – Michael May 17 '12 at 16:23
  • Sorry, but what solution are you looking for? To map everything to ASCII 32-127? True quotes, as above, you don't like. What about (C), (R), etc. symbols? N- & m-dashes? Degree, plusminus, etc. etc.? – a_note May 17 '12 at 16:38
  • Yes the problem is I don't understand why these characters are not decoding properly. In the future, if a (C) or (R) decode as a random character, I need them to show up as the proper (C) or (R). A simple substitution assumes that all PDFs will decode to the same hex value and that I know the hex values of all conflicting characters. These are two conditions I cannot meet. At some point in the PDF there must be information which I can use to determine the proper character for a dynamic substitution and that is what I'm trying to figure out. – Michael May 17 '12 at 19:34
  • 1
    Why "not decoding properly" and "random character"?? Look for Encoding entry in Font dictionary in your file -- WinAnsiEncoding. Characters \x92,\x93,\x94 are what they are according to it or 'Windows-1252' as pointed above. Yor file is very simple, no subsets, no CID fonts, so no problem at all. – a_note May 18 '12 at 03:12
  • Ooh I see. I understand now. Thank you. It is now working. The problem before with the encode/decode was also that ms cmd prompt does not display the chr correctly but it prints in a text file fine. Thanks a_note and daxim! – Michael May 18 '12 at 14:34
  • Alright thought it made sense. Decoding 0x93 and others work correctly. However the hex code of the chars I am receiving, i.e. Õ is 0xD5 in Windows-1252 and the same, 0x00D5 in UTF-8. I assumed the Õ had the hex value of 0x91 or 0x92 which it does not. I did find the WinAnsiEncoding in the PDF dict... – Michael May 18 '12 at 15:30
  • Now I'm here :). As to your edit. Your PDF doesn't return an "Õ" instead of apostrophes. You are struggling with encoding in your DOS prompt. There's lot's of information elsewhere and what i say may not be the best way. Go to properties of cmd.exe window, choose Font, choose Lucida Console. Type chcp from command prompt (just for the record). Now type chcp 1252. Now run your perl script. If you did everything correct, you should see quoteright (not apostrophes) in the output of the script – a_note May 19 '12 at 00:01
0

"receiving" - where? You get "Õ" instead of expected what? And doing exactly what? You know that windows command prompt uses dos code page, not windows-1252, right? (oops, new thread again... probably i should register here :-) )

a_note
  • 1