3

How to convert a CGPDFStringRef to unicode char? I have used CGPDFStringCopyTextString to get the string and then [string characterAtIndex:i] to cast to unichar, is this the right way? or is there any way to get the bytes of the string and convert to unicode directly? Need some guidance here.

Lunayo
  • 538
  • 7
  • 32

3 Answers3

3

NSString is capable of handling of unicode characters itself, you just need to convert the CGPDFString to NSString and further you can use it as follows:

NSString *tempStr = (NSString *)CGPDFStringCopyTextString(objectString);
UPT
  • 1,490
  • 9
  • 25
  • can CGPDFStringCopyTextString(objectString); produce accurate string (no missing byte)? and do you know what is the encoding of CGPDFStringCopyTextString() used? – Lunayo Oct 18 '11 at 05:53
  • It should not lose any bytes, the same way I implemented for spanish language magazines reader and still it's working fine. So I believe it won't leave any bytes. – UPT Oct 18 '11 at 06:01
  • so far did you ever parse text (grab text) from pdf? cause I keep getting character code point that didn't provided in toUnicode table. – Lunayo Oct 18 '11 at 06:10
  • NSString* str = [NSString stringWithUTF8String:[tempStr cStringUsingEncoding:NSUTF8StringEncoding]]; – UPT Oct 18 '11 at 06:23
1

although UPT's answer is correct, it will produce a memory leak

from the documentation: CGPDFStringCopyTextString "...You are responsible for releasing this object."

the correct way to do this would be:

CFStringRef _res = CGPDFStringCopyTextString(pdfString);
NSString *result = [NSString stringWithString:(__bridge NSString *)_res];
CFRelease(_res);
Ismael
  • 3,927
  • 3
  • 15
  • 23
0

It's not a bad idea, even if you can access the CGPDFString directly using CGPDFStringGetBytePtr. You will also need CGPDFStringGetLength to get the string length, as it may not be null-terminated.

See the documentation for more info

Jaffa
  • 12,442
  • 4
  • 49
  • 101
  • cause I think using CGPDFStringCopyTextString some bytes will missing? as I using it to get PDF string and result shows some of the character is different (cannot find the matching glyph in toUnicode table). If using CGPDFStringGetBytePtr how to convert to unicode? – Lunayo Oct 18 '11 at 05:49
  • 1
    CGPDFStringGetBytePtr return the raw internal data – Jaffa Oct 18 '11 at 06:04