37

I understand that it is impossible to determine the character encoding of any stringform data just by looking at the data. This is not my question.

My question is: Is there a field in a PDF file where, by convention, the encoding scheme is specified (e.g.: UTF-8)? This would be something roughly analogous to <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> in HTML.

Thank you very much in advance, Blz

Louis Thibault
  • 20,240
  • 25
  • 83
  • 152

2 Answers2

24

A quick look at the PDF specification seems to suggest that you can have different encoding inside a PDF-file. Have a look at page 86. So a PDF library with some kind of low level access should be able to provide you with encoding used for a string. But if you just want the text and don't care about the internal encodings used I would suggest to let the library take care of conversions for you.

Mattias Wadman
  • 11,172
  • 2
  • 42
  • 57
  • 3
    Thanks for the link. I suppose my original question still stands... is there any way to get the encoding(s) in file metadata? – Louis Thibault May 18 '12 at 17:10
  • 1
    The information is there but it might be a large project to write a parser yourself to figure out which encodings that are used. What problem as you trying to solve? what will you use the list of used encoding for? – Mattias Wadman May 18 '12 at 17:14
  • Mattias, I'm using PDFMiner to extract text fields from scientific papers. I'd like to know what the encoding scheme of the PDF is in order to properly interpret the string text. If at all possible, I'd like to *not* rely on user input (most users don't know what UTF is, anyway), nor would I like to guess the encoding using heuristics. – Louis Thibault May 18 '12 at 17:16
  • In that case i don't think you even have to do the work yourself, PDFMiner will most probably handle all string encoding conversions for you. By [looking at the code](https://github.com/euske/pdfminer/blob/master/pdfminer/converter.py#L144) it seams as the text output (-t text) will output text in UTF-8 encoding (can be changed with -c). – Mattias Wadman May 18 '12 at 17:24
  • Yes, but that's precisely the problem. By default, it outputs to UTF-8, but I'd like to be able to check if the PDF file is, indeed, encoded as such. – Louis Thibault May 21 '12 at 11:02
  • 2
    Why do want to know the different encoding used inside the PDF? Isn't is good that you don't need to deal with that? If the problem is that you want to output the text inside the PDF to different encoding i think you better of to always extract it as UTF-8 and convert it to whatever encoding you want it to be. – Mattias Wadman May 21 '12 at 11:23
  • perhaps I'm misunderstanding what PDFMiner is doing... As I understand it, it needs to know what the encoding scheme is in order to output the text to that same format. By default, it assumes UTF-8, but has no way of knowing whether or not this is correct. Perhaps I'm wrong? Perhaps it figures out this encoding scheme either by taking an enducated guess or by looking at a data field that contains the information? – Louis Thibault May 28 '12 at 12:55
  • Im quite sure PDFMiner parses and decodes all the different encodings in a PDF to Unicode internally and then you can use `-c` to change what encoding to output (default is UTF-8). But why debate, can't you just find or create some PDFs with different encodings and try? – Mattias Wadman May 28 '12 at 14:45
  • That's exactly what I've been doing... testing PDFs of various encodings. Unfortunately, I'm getting a lot of garbled text, so I'm guessing that either PDFMiner is incorrectly identifying the encoding, or that it's assuming UTF-8 in all cases... Hence my original question: is there a field within the PDF format that contains information about the encoding scheme? – Louis Thibault May 28 '12 at 15:42
  • Ok good! How are you using the output from PDFMiner? can you give some code example and maybe input and output examples? what is the final format that the text from the PDFs will be used in? HTML page? – Mattias Wadman May 28 '12 at 17:46
  • 6
    -1 for link-only answer. http://meta.stackexchange.com/questions/8231/are-answers-that-just-contain-links-elsewhere-really-good-answers – Mark E. Haase Mar 28 '14 at 16:46
  • Ok but not really _only_ a link and at least a couple of people have found it useful and upvoted – Mattias Wadman Mar 28 '14 at 18:36
0

PDF uses "named" characters, in the sense that a character is a name and not a numeric code. Character "a" has name "a", character "2" has name "two" and the euro sign has name "euro", to give a few examples. PDF defines a few "standard" "base" encodings (named "WinAnsiEncoding", "MacRomanEncoding" and a few more, can't remember exactly), an encoding being a one-to-one correspondence between character names and byte values (yes, only 0 to 255). The exact, normative values for these predefined encodings are in the PDF specification. All these encodings use the ASCII values for the US-ASCII characters, but they differ in higher byte values.

A PDF file may define new encodings by taking a "base" encoding (say, WinAnsiEncoding) and redefining a few bytes, so a PDF author may, for example, define a new encoding named "MySuperbEncoding" as WinAnsiEncoding but with byte value 65 changed to mean character "ntilde" (this definition goes inside the PDF file), and then specifying that some strings in the file use encoding "MySuperbEncoding". In this case, a string containing byte values 65-66-67 would mean characters "ñBC" and not "ABC". And note that I mean characters, nothing to do with glyphs or fonts. Different strings withing the PDF file may use different encodings (this provides a way for using more tan 256 characters in the PDF file, even though every string is defined as a byte sequence, and one byte always corresponds to one character).

So, the answer to your question is: characters within a PDF file can well be encoded internally in an ad-hoc encoding made on the spot for that specific PDF file. PDF parsers should make the appropriate substitutions when necessary. I do not know PDFMiner but I'm surprised that it (being a PDF parser) gives incorrect values, as the specification is very clear on how this must be interpreted. It IS possible to get all the necessary information from the PDF file, but, as Mattias said, it might be a large project and I think a program named PDFMiner should do exactly this kind of job.

  • 5
    Your answer is actually misleading - it's perfectly possible to encode text in a PDF document in such a way that it cannot be extracted in a meaningful way. If the fonts used, don't use simple encodings and there is no "ToUnicode" information present, you're left with something you can print but not extract / convert to say UTF-16. This is the reason that some standards (such as PDF/A-1a for example) require ToUnicode information to be present for all text. – David van Driessche Dec 03 '15 at 14:27
  • 1
    @Jojonete *(yes, only 0 to 255)* - No. You completely ignore *Composite Fonts* which can have multi-byte encodings, even mixed ones, e.g. the predefined encoding **GBK2K-H** is a *mixed 1-, 2-, and 4-byte encoding*. And this by far is not the only misinformation in your answer. – mkl Dec 03 '15 at 15:22
  • Knowing a font-family is there a way to followup text extraction, leverage that font somehow and still end up with readable text somehow? – jayarjo May 03 '21 at 09:12