0

I'm trying to extract the text from this file:

https://www.dropbox.com/s/249snnj1nsve5ir/Lebenslauf.pdf?dl=0

using CGPDFScanner. I can detect that the character encoding is WinAnsiEncoding from the included PDF dictionary, but the characters all come out garbled. As cross check, I tried copy pasting text from Preview app in Mac OS X, which works - so somehow it must be possible to extract it as Strings. On the other hand, the commercial 3rd party framework http://www.fastpdfkit.com can't correctly extract the text, too.

Anyone has an idea what I'm missing?

As a side note, I was using https://github.com/KurtCode/PDFKitten to scan the PDF.

skubo
  • 58
  • 9
  • Just found out that on iOS8 [NSString stringEncodingForData ... ] returns "Cyrillic" as NSStringEncoding for the content part of the PDF. Still, even when creating a string from the data stream with that Character Encoding, it does not give readable output. – skubo Apr 17 '15 at 11:48
  • Well, "Cyrillic" is provably not correct :) The encoding for the main font on p.1 is given as `/WinAnsiEncoding`, with a `/Differences` array overwriting character codes for 1 to 75 (ish), which *may* clarify why assuming it's plain Win ANSI fails. But! it also contains a `/ToUnicode` table, and any PDF text extractor worth its storage space ought to be able to use that. My own, highly experimental, PDF reader can read the text just fine, with `ä`s and `ö`s and `ß`s and all. Perhaps you should post your (currently failing) code. – Jongware Apr 18 '15 at 13:12
  • Thanks for your insights. Whilst stepping through the PDFKitten project it does not seem to use the tounicode table for the included fonts. As for the code - if you clone that project from the link above, it's basically in the scanner.m class where the garbled text arrives, but the mapping stuff happens in a bunch of other classes, which would be too much to post here. – skubo Apr 20 '15 at 05:45

0 Answers0