4

I have this PDF file, which is in Greek. A known problem occurs when trying to copy and paste text from it, resulting in slight gibberish. The reason I say slight instead of total, is that while the pasted output does not make sense in Greek, it is comprised of valid greek characters. Also, an interesting aspect to the problem is that not all characters are mapped wrong. For example, if you compare this original strip of text

ΕΞ. ΕΠΕΙΓΟΝ – ΑΜΕΣΗ ΕΦΑΡΜΟΓΗ
ΝΑ ΣΤΑΛΕΙ ΚΑΙ ΜΕ Ε-ΜΑIL

with the pasted one from the PDF:

ΔΞ. ΔΠΔΙΓΟΝ – ΑΜΔ΢Η ΔΦΑΡΜΟΓΗ
ΝΑ ΢ΣΑΛΔΙ ΚΑΙ ΜΔ Δ-ΜΑIL

you will notice that some of the characters are correctly pasted, while others are not. It might also be worthwhile to mention that the wrong characters are reflexively mapped wrong, e.g. Ε becomes Δ and vice-versa.

When I open the PDF using e.g. Adobe, and print it using a PDF writer, in this case CutePDF, the output when copying and pasting is correct!

Given the above, my questions are the following:

  1. What is the root cause of this behavior?
  2. How would I go about integrating a solution into a java-based workflow for randomly imported PDF files?

EDIT: a few typos

Community
  • 1
  • 1
millenseed
  • 223
  • 2
  • 9
  • What are you trying to do? I am pretty sure cutepdf is using a UTF-8 encoding whereas whatever else you are doing is probably ASCII. If you are trying to do copy as a option in windows, you need to install the language pack – Ya Wang Oct 02 '15 at 14:18
  • I am just trying to parse it eventually, but saving a correct version would be another plus. Imagine a java-based solution that is fed with PDF files and parses the output, passing it on to the next module. – millenseed Oct 02 '15 at 14:19
  • https://pdfbox.apache.org/ should handle utf-8 – Ya Wang Oct 02 '15 at 14:20
  • This is an encoding issue... I'm guessing that application you are copying from has a different encoding to the application you are pasting into – lance-java Oct 02 '15 at 14:20
  • http://stackoverflow.com/questions/5425251/using-pdfbox-to-write-utf-8-encoded-strings-to-a-pdf Here's a link to a guide about pdfbox and utf8 encoding. – Ya Wang Oct 02 '15 at 14:22
  • I am a bit confused, why your question is tagged with `java`. I did have similar issues with PDFs and printing/rendering them with Java/JavaScript. There the problem was that the fonts were not embedded or if they where they could not be reconstructed properly. That was why it works with Acrobat but not with some other applications. Have a look at [this](https://blogs.mtu.edu/gradschool/2010/04/27/how-to-determine-if-fonts-are-embedded/) description, which might give you a hint on which fonts are used and embedded. Also check which fonts are installed on the system. – hotzst Oct 02 '15 at 14:23
  • But what does CutePDF do under the hood? The `java` tag was there because I want to incorporate the solution in a java-based framework. – millenseed Oct 02 '15 at 14:49
  • 1
    The interesting parts are: (1) the font claims to use `/WinAnsiEncoding`, (2) the text is stored as double-byte codes (which combined with (1) makes me scratch my head), and (3) my own PDF reading tool, written with the specifications in hand, yields *exactly* the same erroneous output as yours and Adobe Acrobat itself. I'd need a good look at my own source code just to recall out how this part exactly works (well, badly, in this case). – Jongware Oct 02 '15 at 15:07
  • Unfortunately, though, nothing can be changed in the publisher's side. These are supposed to be documents issued from public authorities. What do (1) and (2) tell us though? – millenseed Oct 02 '15 at 15:16
  • Well, how deep does your current pdf fu go? WinAnsiEncoding is supposed to be a simple one-on-one byte mapping. So I need to look into my source what actually happens with this when it receives *double* bytes. WinAnsiEncoding is not supposed to handle that. – Jongware Oct 02 '15 at 17:20

1 Answers1

2

Some basic context:

Displaying text in PDF is done by selecting glyphs from a font. A glyph is the visual representation of one or more characters. Glyph selection is done using character codes. For text extraction, you need to know which characters correspond with a character code.

In this case, this is achieved using a ToUnicode CMap.

In this document, the first letter of the text snippet, E, is displayed like this:

[0x01FC, ...] TJ

The ToUnicode CMap contains this entry:

4 beginbfrange
<01f9> <01fc> <0391>
...
endbfrange

This means that character codes 0x01F9, 0x01FA, 0x01FB and 0x01FC are mapped to Unicode U+0x391, U+0x392, U+0x393 and U+0x394 respectively.

U+0394 is the Greek delta, Δ, that shows up when copy/pasting.

The next letter is painted using character code 0x0204. The relevant ToUnicode entry is <0200> <020b> <039a>, which maps it correctly to U+039E

So, you're getting slight gibberish, because only some of the Unicode mapping is wrong. Sometimes this is done on purpose, e.g. to prevent data mining. I have seen it before in financial reports.

rhens
  • 4,791
  • 3
  • 22
  • 38
  • Can a CMap be replaced? – millenseed Oct 05 '15 at 12:34
  • Yes. Using a PDF library that offers low-level access to PDF objects, it's just a matter of traversing the PDF structure, locating the **ToUnicode** stream and replacing it with a new one. The challenge will be in creating the proper **ToUnicode** map. There's no guarantee you can construct a generic one that you can reuse in different PDF files. – rhens Oct 07 '15 at 20:48
  • I've tried the following flows: 1. Open with Adobe Reader and print with CutePDF (successfully prints to file and text is OK) 2. Open with Chrome and print with CutePDF (fails, prints to pdf is basically an image) 3. Print with CutepDF from Java (creates exactly the same pdf as the original) Is it the case that Adobe Reader does something that other methods can't mimic? – millenseed Oct 09 '15 at 10:45
  • Can you provide the output of *Open with Adobe Reader and print with CutePDF*? Probably the **ToUnicode** is fixed or removed. – rhens Oct 09 '15 at 11:21