pdfmark for docinfo metadata in pdf is not accepting accented characters in Keywords or Subject

Question

I am inserting metadata into postscript files with a program, to be distilled to pdf with Adobe Distiller. I am using this code that I grabbed from an online chapter of Thomas Merz's "Web Publishing with Acrobat-PDF":

/pdfmark where {pop} {userdict /pdfmark /cleartomark load put} ifelse

[ /Title (mot accenté)

  /Author (mot accenté)

  /Subject (mot accenté)

  /Keywords (mot accenté) 

/DOCINFO pdfmark

When you look at the metadata in the resulting pdf, the accented characters turn into "?" in the Subject and Keyword fields, but not the Title and Author fields. The characters are the same ascii 233

I tried replacing them with octal encoding (\351), which came out the same (Title and Author okay, Subject and Keywords messed up).

file encoding is latin-1,unix eol

I found a mention on adobe forums, but the answer didn't make sense to me.

http://forums.adobe.com/message/1165593 and http://forums.adobe.com/thread/307687

I changed the encoding to utf-8, inserted the characters binarily (in VIM : <Ctrl-v>u00e9), no change. I tried inserting the BOM in a few places, it didn't work.

This is with the Distiller from Acrobat Pro 9 (9.3.3177)

I didn't notice this problem with Acrobat Pro 7.

Does anybody know of a workaround to get the accented characters into ALL the metadata fields when modifying a postscript file, or tell me if I'm doing it wrong?

It seems weird that different fields would not accept the same bytes.

Possibly related SO question: Unicode in PDF

I am embedding all fonts.

Hi, plinth, I don't understand your edit (version 5) where you added a close bracket. I tried that in my .ps file and Distiller won't accept it: Error: unmatchedmark; OffendingCommand: ] — rpilkey, Jun 17 '10 at 18:33
the edit adding the closing "]" is plain wrong. The `[` operator opens the code block, the `pdfmark` operator closes it. There is no need **and no place** for a `]`. — Kurt Pfeifle, Aug 16 '10 at 21:14

score 2 · Answer 1 · edited May 23 '17 at 12:26

Your last reference contained good hint to use Hex characters Unicode in PDF (see feedback from Mark Storer)

So instead of

[ /Title (mot accenté)

you could try

[ /Title <FFEF006D006F007400200061006300630065006E007400E9>`

etc ...

Might be little bit clumsy, but with the little help from shell scripts it helped me to add other special characters like 'ä', 'õ', 'ü' into pdf bookmarks.

score 2 · Accepted Answer · answered Jun 17 '10 at 17:21

2

Can you try using UTF16-BE for the encoding and starting the strings with 254 and 255 (thorn and y-dieresis)?

answered Jun 17 '10 at 17:21

plinth

48,267
11
78
120

I tried opening the .ps file in Notepad++, go to Encoding , Convert to UCS-2 Big Endian, then save. It added the BOM at the beginning of the file and doubled its size, so I think it worked. Distiller errors out with: %%[ Error: undefined; OffendingCommand: þÿ ]%% %%[ Flushing: rest of job (to end-of-file) will be ignored ]%% %%[ Warning: PostScript error. No PDF file produced. ] %% So Distiller won't even look at a UCS-2 file here. This is on Windows XP by the way, if that makes a difference. – rpilkey Jun 17 '10 at 18:16
You don't want to convert the whole file to utf16-be, only the strings, so your strings should be /Subject (þÿ...) etc. – plinth Jun 17 '10 at 18:35
Thanks. That works. The string that works for my example is: /Subject (þÿ^@m^@o^@t^@ ^@a^@c^@c^@e^@n^@t^@é) where "^@" is the nul byte. (that's how it's displayed in Vim) Putting this into ascii files will be a chore, but it's doable. I don't know why those two fields require this but "Title" and "Author" don't. – rpilkey Jun 17 '10 at 18:59
To type in the nul byte in Vim, Ctrl-V u 0000 – rpilkey Jun 17 '10 at 19:06

score 1 · Answer 3 · answered Jun 14 '10 at 18:54

1

So, you're supposed to be able to use an ANSI encoded file and any characters which are in the PDFDocEncoding set (which the French accented characters are), but that doesn't work.

Another method is to still use a latin-1 encoded file, but put Unicode characters in octal form (2 bytes: \xxx\xxx). And start the string with the BOM : \377\366

So the above subject string "mot accenté" has to be translated to:

/Subject (\377\376\155\000\157\000\164\000\040\000\141\000\143\000\143\000\145\000\156\000\164\000\351\000)

This works, but it sucks. Anyone have anything better?

answered Jun 14 '10 at 18:54

rpilkey

945
1
10
16

1

IIRC, it should be enough to just encode the accented character alone, keeping the rest in clear text ASCII. Like this: `/Subject (mot accent\351)`. – Kurt Pfeifle Aug 16 '10 at 21:17
Your solution works for the Title and Author fields, but not for the Subject and Keyword fields. This is with Adobe's Distiller 9.3.3177. – rpilkey Aug 19 '10 at 12:37

Kurt Pfeifle · Answer 4 · 2012-07-05T18:36:53.900

1

You do not need to escape/encode ALL the accented characters!

It is enough to keep the standard ASCII characters and just mix in the \NNN notation where a special character should appear.

The following Ghostscript command creates a two page PDF. It will have nearly empty pages, with 2 bookmarks/outlines included, plus the metadata with accents. Example is for Windows, on Unix/Linux just use gs and change the line end escapes from DOS batch's ^ to unix shell's \:

gswin32c.exe ^
 -sDEVICE=pdfwrite ^
 -o 2-empty-pages-with-bookmarks-and-accents-in-metadata.pdf ^
 -c "[/Creator(brains&smarts)/Author(pipitas)/Subject(m\350t accent\351)/Title(mot accent\352)/Keywords(ganz sch\353\353 bl\353\353\d!)/DOCINFO pdfmark" ^
 -c "[/Page 1 /View [/XYZ null null null] /Title (Page One) /OUT pdfmark" ^
 -c "[/Page 2 /View [/XYZ null null null] /Title (Page Two) /OUT pdfmark" ^
 -c "200 500 moveto /Helvetica findfont 100 scalefont setfont (One) show showpage 200 500 moveto (Two) show showpage quit"
  .

I hope this finally settles your question "Does anybody know of a workaround to get the accented characters into ALL the metadata fields when modifying a postscript file?".

edited Jul 05 '12 at 18:36

answered Aug 16 '10 at 21:54

Kurt Pfeifle

86,724
23
248
345

Your solution works for the Title and Author fields, but not for the Subject and Keyword fields. This is with Adobe's Distiller 9.3.3177. – rpilkey Aug 19 '10 at 12:37
@rpilkey: it works for me for Subject and Keyword fields as without an obvious problem. Adobe Reader 9.3.3. – Kurt Pfeifle Aug 19 '10 at 14:06
Ah, but which distiller? You seem to be using Ghostscript, so it might be a bug in Adobe's distiller. – rpilkey Aug 19 '10 at 23:01
@rpilkey: Yes, my given commandline uses Ghostscript, and in the paragraph above I said: *"The following Ghostscript command creates a two page PDF."* – Kurt Pfeifle Aug 20 '10 at 08:40

DrBeco · Answer 5 · 2012-12-09T17:26:01.873

Altough this do not directly answer your question, google has lead me here when searching for "pdf metadata accented".

So, maybe useful for others to know that you can change a pdf metadata using pdftk

And to include accented characters, use HTML CODE

It took me some while to figure out how come "Baçan" was shown as "BaÄ§an", but that's because PDF metadata does not accept UTF8.

Example of metadata for Júlio Verne:

InfoKey: Author
InfoValue: J&#250;lio Verne

Also, I could use hexedit and manually insert the HEX code into the correct position.

é = HEX E9 HTML: &#233;
ç = HEX E7 HTML: &#231;
ú = HEX FA HTML: &#250;
ó = HEX F3 HTML: &#243;

and so on. Take a look at the table above.

I hope this serves to help someone.

pdfmark for docinfo metadata in pdf is not accepting accented characters in Keywords or Subject

5 Answers5