1

I've stumbled upon a searching cyrillic (as well as any other non-ASCII) characters in PDF using PDDScanner. The code I am using is similar to mentioned to SO code from Randon ideas blog. The problem is that for cyrilic PDFs the output of scanner is a complete garbage, which can't be decoded to anything meaningful. English characters in cyrillic PDFs are searched just perfectly. So, the problem is that when it comes to cyrillic it is encoded and we can't get how to decode it properly.

What do we miss here?

Thanks in advance to anyone who can shed any light on the subject.

Adviser2010
  • 115
  • 1
  • 9

2 Answers2

1

Have you tried pushing that string through a different encoding? When I look at NSString.h, I see something suspiciously labelled "cyrillic" which also has "Adobe" on the same line :) (i.e., try NSWindowsCP1251StringEncoding)

enum {
    NSASCIIStringEncoding = 1,      /* 0..127 only */
    NSNEXTSTEPStringEncoding = 2,
    NSJapaneseEUCStringEncoding = 3,
    NSUTF8StringEncoding = 4,
    NSISOLatin1StringEncoding = 5,
    NSSymbolStringEncoding = 6,
    NSNonLossyASCIIStringEncoding = 7,
    NSShiftJISStringEncoding = 8,          /* kCFStringEncodingDOSJapanese */
    NSISOLatin2StringEncoding = 9,
    NSUnicodeStringEncoding = 10,
    NSWindowsCP1251StringEncoding = 11,    /* Cyrillic; same as AdobeStandardCyrillic */
    NSWindowsCP1252StringEncoding = 12,    /* WinLatin1 */
    NSWindowsCP1253StringEncoding = 13,    /* Greek */
    NSWindowsCP1254StringEncoding = 14,    /* Turkish */
    NSWindowsCP1250StringEncoding = 15,    /* WinLatin2 */
    NSISO2022JPStringEncoding = 21,        /* ISO 2022 Japanese encoding for e-mail */
    NSMacOSRomanStringEncoding = 30,

    NSUTF16StringEncoding = NSUnicodeStringEncoding,      /* An alias for NSUnicodeStringEncoding */

    NSUTF16BigEndianStringEncoding = 0x90000100,          /* NSUTF16StringEncoding encoding with explicit endianness specified */
    NSUTF16LittleEndianStringEncoding = 0x94000100,       /* NSUTF16StringEncoding encoding with explicit endianness specified */

    NSUTF32StringEncoding = 0x8c000100,                   
    NSUTF32BigEndianStringEncoding = 0x98000100,          /* NSUTF32StringEncoding encoding with explicit endianness specified */
    NSUTF32LittleEndianStringEncoding = 0x9c000100        /* NSUTF32StringEncoding encoding with explicit endianness specified */
};
Scott Corscadden
  • 2,831
  • 1
  • 25
  • 43
  • unfortunately, it doesn't help. I tried to convert NSSTring using an encoding above but it is not working... – Adviser2010 Apr 13 '12 at 08:07
  • NSData *data=[currentData dataUsingEncoding:NSUTF8StringEncoding]; NSString *ddd = [[NSString alloc] initWithData:data encoding:NSWindowsCP1251StringEncoding]; the problem is that when I try to decode the whole stream returned by a scanner it returns null. other encodings again gave me a complete garbage again. – Adviser2010 Apr 13 '12 at 10:52
0

You might have to get deeper into the Apple spec and headers on this - add NSLog lines (and post them here) for what the scanner finds for the normal PDF and the cyrillic ones. There are lots of possibilities (perhaps a different encoding, i.e. you need to translate the string you have to a different one using the encoding). I'm sure there is a way to list all the operators in the table, to see if there are extra ones in your cyrillic pdf. Also, this might help as a vastly similar problem you're trying to solve - it points to a library that is more tuned to scanning too.

Community
  • 1
  • 1
Scott Corscadden
  • 2,831
  • 1
  • 25
  • 43
  • I added NSlog and got the following (an excerpt) ˜ ˜˚˚ ˛˚˚-˜˝-˙˝www.mediayug.ru˜˚˛˝˙ˆ ˇ˘ ˆ ˙ˆ ˛˝˛ˆ: ˆˇ˘ ˘! ˘ ˘, ˘ ˘ ˘ˇ˘ iPad ˜ ˚˛˝˙ˆˇ˘˙ ˚ ˘ˇ ˝ ˙˝ ˚˛˝˘ ˇ Android 2012-04-09 14:24:32.238 PublishLike[8939:16d03] ı ¾ à ¶  ¾ À ¶ Å Æ Ä ¾ ½ ¸ Ä º Ç È ¸ ¶ ¾ Å Æ Ä º ¶ ¼ Ž € à ¶ Æ Ä Ç Ç ¾ ¿ Ç À Ä Â Æ Ñ Ã À » Ç Ä Ç È ¶ ¸ Á Õ » È ˇ ¸ Ç Æ » º à »  ¸ È » Í » à ¾ » Å Ä Ç Á » º à ¾ Ë – Adviser2010 Apr 09 '12 at 10:29
  • notice that english characters are bot encoded. but the rest of text looks cryptic. – Adviser2010 Apr 09 '12 at 10:30
  • Wish I could help more - who created/authored said pdf in the first place? Can you contact them? What PDF edit tools can you try/buy that might help you analyze the internal table codes? – Scott Corscadden Apr 20 '12 at 11:45