4

When I create a PDF form (for instance using Acrobat) that contains text fields in AcroForm format (PDF dictionaries, no XFA), and I submit the data to a server, how can I specify/retrieve the encoding that will be used?

For instance. When I submit the Chinese glyphs '测试' (test), I receive the following headers and content on the server-side:

accept: application/x-ms-application, image/jpeg, application/xaml+xml, image/gif, image/pjpeg, application/x-ms-xbap, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*
content-type: application/x-www-form-urlencoded
content-length: 23
acrobat-version: 10.1.4
user-agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; MDDC; .NET4.0C; AskTbCLA/5.15.1.22229)
accept-encoding: gzip, deflate
connection: Keep-Alive
Song=%b2%e2%ca%d4&Test=

There's no reference to an encoding, except x-www-form-urlencoded. The two glyphs are represented as four bytes: B2 E2 CA D4. After some investigation, I know that B2E2 is the GBK value for the first glyph, and CAD4 the GBK value for the second glyph, but I can't derive this from the request header.

Is it always GBK? I want to change the data encoding by setting a specific key in a dictionary in the PDF, but there doesn't seem to be any. For instance: I would like make sure the PDF always sends Unicode characters instead of GBK.

Note that I've already experimented by changing the default font (and encoding) of the text field. I've also searched ISO-32000-1 for encodings in fields, but all I found was a way to define non-Latin characters for check boxes, and some info about the encoding of an FDF file. None of which answered my questions.

Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
  • Are you submitting using a submit action or using javascript? I found the following phrase in the spec (implementation notes), maybe it is relevant for this question: "Because JavaScript 1.2 is not Unicode-compatible, PDFDocEncoding and the Unicode encoding are translated to a platform-specific encoding before interpretation by the JavaScript engine." – yms Sep 26 '12 at 16:09
  • I'm creating a Submit button in PDF, setting some flags to submit it as a POST (instead of a GET), etc... I don't use Javascript, but that doesn't mean that Adobe Reader isn't using Javascript under the hood. In any case, the response is encoded: the glyphs are sent to the server as %b2%e2 and %ca%d4 (GBK value of the glyphs). My question is: why not as %6d%4b and %8b%d5 (Unicode value of the glyphs)? – Bruno Lowagie Sep 26 '12 at 17:23
  • What I am saying is that maybe those GBK values are the "platform-specific encoding" that the "implementation details" section of the PDF spec is talking about. – yms Sep 26 '12 at 17:29
  • I saw in other section of the spec that a resource object specifing the encoding is supposed to be added by the reader to the DR entry of the field. Maybe you can change the flags of your submit action, send the whole file instead to the server, and take a look on this resource object. It may help diagnosing the issue. – yms Sep 26 '12 at 17:43
  • Good to know, excellent question and accompanying answer. – user692942 Aug 16 '18 at 08:52

1 Answers1

8

I've just found the answer to my main question myself. I didn't find anything in ISO-32000-1 or the ISO-32000-2 draft, but studying the Acrobat JavaScript reference, I found the cCharset parameter that is available for the submitForm() method. That parameter defines:

The encoding for the values submitted. String values are utf-8, utf-16, Shift-JIS, BigFive, GBK, and UHC. If not passed, the current Acrobat behavior applies. For XML-based formats, utf-8 is used. For other formats, Acrobat tries to find the best host encoding for the values being submitted. XFDF submission ignores this value and always uses utf-8.

In other words: in my case GBK was used because it fits best to submit Chinese characters. However, one could force UTF-8 by using the submitForm() JavaScript method using the appropriate value.

Based on this question, I have asked the ISO committee to fix this problem in ISO-32000-2. As a result, an extra possible entry was added to the table entitled Additional entries specific to a submit-form action in section 12.7.6.2:

CharSet: string

(Optional; inheritable) Possible values include: utf-8, utf-16, Shift-JIS, BigFive, GBK, or UHC.

Starting with PDF 2.0, this problem will no longer exist.

Update: my suggestion made ISO 32000-2 (aka PDF 2.0):

enter image description here

The CharSet key doesn't exist in ISO 32000-1; it was introduced in ISO 32000-2.

Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
  • 1
    While this surely resolves the issue for Adobe Acrobat/Reader, there still is no general resolution for PDF viewers in general. Good find nonetheless. – mkl Dec 16 '12 at 16:27
  • 4
    I've sent this to Adobe and to the ISO committee. I'll try to have this documented in ISO-32000-2. – Bruno Lowagie Dec 17 '12 at 07:17
  • 1
    @Lankymart Thanks, in the meantime, ISO 32000-2 was approved, and the **CharSet** parameter has been added to the spec. – Bruno Lowagie Aug 16 '18 at 09:31