0

I finally got some sort of pdf scanner to work. It reads into the callback functions without a problem, but when I try to NSLog the result from a CGPDFScannerPopString I get a result like this:

ˆ ˛˝     #    ˜˜˜      #˜'  ˜˜˜      "˜   '˜˜      " '   ˜˜

No string to be found here...

Any ideas of what it can be? This is my callback function:

static void op_Tj (CGPDFScannerRef s, void *info)
{
    CGPDFStringRef string;

    if (!CGPDFScannerPopString(s, &string))
        return;

    NSLog(@"string: %@", (__bridge NSString *)CGPDFStringCopyTextString(string));
}

Thanks already!

Edit: Example PDF

Ron
  • 1,047
  • 13
  • 18

1 Answers1

2

You should be aware that the CGPDFStringRef is not a ASCII string or something similar at all. Cf. http://developer.apple.com/library/mac/documentation/graphicsimaging/Reference/CGPDFString/Reference/reference.html --- it is a "series of bytes—unsigned integer values in the range 0 to 255" which have to be interpreted according to the latest PDF reference.

The PDF reference in turn will tell you that the interpretation of the bytes depends on the font used, and while ASCII-like interpretations are common in case of European languages, they are not mandatory, and in case of Asian languages where font subset embedding is very common, the interpretation may look random.

CGPDFStringCopyTextString tries to interpret those bytes accordingly, but there does not have to be a sensible interpretation as a regular string.

EDIT Inspection of the sample PDF Ron supplied showed that in case of this sample indeed the encoding of the font in object 3 0 (which is dominant on most pages of the document) is not a standard encoding but instead:

<</Type/Encoding
  /Differences[0/.notdef/C/O/V/E/R/space/slash/H/L/F/underscore/W/B/five/eight/four
                /zero/two/six/D/one/period/three/Z/I/N/G/U/S/T/colon/seven/A/M/P/Y
                /plus/nine/X/hyphen/i/s/p/a/t/c/h/n/f/o/K/greater/equal/l/m/y/J/Q
                /parenleft/parenright/comma/dollar/ampersand/d/r/v/b/e/u/w/k/g/x/bar
                /quotesingle/asterisk/q/question/percent]
>>

Looking at the top of the first document page

COVER / HLF_CWEB_58408485 / 58408485 / 26DEC12 10.30.22Z


BRIEFING INCLUDES FOLLOWING FLIGHTS:

26DEC12 OR0337 EHAM0630 MUVR1710 PHOYE VSM+2/8 179

NEXT FLIGHTS OF AIRCRAFT:

26DEC12 OR0338 MUVR1830 MMUN1940 PHOYE VSM+2/8 213
26DEC12 OR0338 MMUN2105 EHAM0655 PHOYE GPT+2/7 263
27DEC12 OR0365 EHAM0900 TNCB1930 PHOYE BAH+1/8 272
27DEC12 OR0366 TNCB2030 TNCC2110 PHOYE BAH+1/8 250
27DEC12 OR0366 TNCC2250 EHAM0835 PHOYE ASD+1/8 199 

that encoding seems to have been created by dealing out the next number starting from one for the next required glyph. This obviously results in a highly individualistic encoding...

That being said the font object does include both an /Encoding entry and a /ToUnicode entry. Thus, if the method CGPDFStringCopyTextString was given a reference to the font here and really tried, it would easily be able to correctly translate those bytes into the corresponding text. That it doesn't achieve anything decent, seems to indicate that it simply does not have the information which font to interpret the bytes for --- I don't assume it doesn't try...

For accurate text extraction, therefore, you have to interpret the bytes in the CGPDFStringRef yourself using the information of the the font in the content stream. If you don't want to do that from scratch, you might be interested in PDFKitten, a framework for extracting data from PDFs in iOS. While it is not yet perfect (some font structures can baffle it), it is a good starting point.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Aha, makes a bit more sense now... I searched some and read that there should be a ToUnicode entry in the document. There is but maybe you can help me out how to use this? – Ron Dec 22 '12 at 00:32
  • Never mind, FastPDFKit isn't even able to extract the text. Don't think I would be able to do it then... – Ron Dec 22 '12 at 01:43
  • The PDF I'm trying to read is sort of private company information. I'll see if I can filter out some stuff and post this... Maybe you can take a look at it. – Ron Dec 26 '12 at 09:07
  • I added an example to my post... Maybe you can take a look at it to see if one of you guys can extract data from it. – Ron Dec 26 '12 at 11:00
  • @Ron I looked at the PDF you supplied and edited my answer accordingly. In a nutshell: The predominant encoding used in it indeed is very individualistic, but it is well described in the font object. Thus, the method CGPDFStringCopyTextString seems less than adequate for the job of text extraction. You might want to look at [PDFKitten](https://github.com/KurtCode/PDFKitten). – mkl Dec 26 '12 at 19:12
  • Nope, sorry. We won't be getting married. Tried and read everything I could. Used the frameworks FastPDFKit, PDFKitten, PSPDF and some minor smaller stuff but none of them is able to extract the text from the pages. Tried googling for a way to interpret the bytes of the CGPDFStringRef but I couldn't find an answer anywhere. So if you can make it happen, be my guest but otherwise I will stop pulling my hair out now... *sigh* – Ron Dec 26 '12 at 22:06
  • @Ron well, so no marriage... *g I've got to admit that I didn't test whether PdfKitten can handle the file. All I had at hand was my Windows PC and Java on it, and a Java text extraction program using iText could correctly extract the text without any further ado. The problem for PDFKitten might be the same as observed in [this SO question](http://stackoverflow.com/questions/12914479/pdfkitten-is-highlighting-on-wrong-position/12932653). Maybe the one-line-fix applied to PDFKitten in the course of this question wasn't sufficient after all... – mkl Dec 27 '12 at 01:20
  • @mkl the example in PDFKitten only displays the PDF pages with the CGContextDrawPDFPage() function without touching any of the stringWithPDFString() methods in the framework. If I want to test these methods, how do I initialize those font objects and use their implementations of this function? – CodeBrew May 21 '18 at 03:54
  • @CodePlumber please make this a question in its own right. I have not dealt with pdf processing under iOS for a number of years, so I'm not in a position to help here anymore. – mkl May 21 '18 at 06:54