Your search text RAVI's contains a vertical apostrophe; have you checked whether the PDF does not contain a forward or backward slanted version of that character instead? Those different versions have different character codes after all.
In the context of question PDFKitten is highlighting on wrong position, it became appearant that that library returns ligatures as single ligatured characters instead of 'de-ligaturized' character groups. If your text contains ligatures, that might be the reason.
In the context of the same question, PDFKitten font data parsing turned out to be deficient in some respects. In reaction to that question, a workaround for one such deficiency has been added to the code which in my eyes didn't fix the general case, merely some special cases, cf. the proposal in my answer there.
Furthermore some fonts simply do not contain the information for mapping their glyphs back to unicode characters. You say the Special Characters are not able to search --- maybe those special characters are taken from a different font not supporting parsing.
Theoretically the apostrophe might even have been drawn using graphic, non-text operators. In that case text parsing won't find it.
If none of these ideas explain your case (or you cannot check whether they do), please supply the sample PDF for inspection.
EDIT (taking into account your Brivo MR355 copy.pdf sample file)
I assume that again the apostrophe is troublesome, this time in MR355’s. There are two accurances in the raw page content,
/TT1 1 Tf
0.559 0 Td
(Brivo MR355\222s Ready Bar technology replaces 30 complex inputs with a single control, simplifying scan optimization )Tj
and
/TT1 1 Tf
0.559 0 Td
(Brivo MR355\222s Ready Interface)Tj
Both time the font resource /TT1 is used, and both times the apostrophe is encoded as \222 which is octal for 146 decimal, quoteright in WinAnsiEncoding, trademark in PDFDocEncoding.
/TT1 is
/LastChar 146
/BaseFont /REEDOQ+GEInspira
/Type /Font
/Subtype /TrueType
/Encoding /WinAnsiEncoding
/Widths [232, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 198, 0, 0, 0, 530, 0, 0, 530, 0, 530, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 570, 0, 0, 0, 0, 0, 0, 243, 0, 0, 0, 764, 0, 0, 0, 0, 556, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 545, 545, 482, 545, 509, 297, 545, 544, 210, 0, 0, 210, 836, 544, 537, 545, 545, 341, 437, 317, 544, 474, 736, 471, 474, 427, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 190]
/FontDescriptor 32 0 R
/FirstChar 32
/LastChar being 146 and /Encoding being /WinAnsiEncoding should make it easy for PDFKitten to recognize the \222 as quoteright character.
As one of your comments indicated that you are not using the most current PDFKitten version, I'll do the code analysis based on an older copy, too.
PDFKitten while parsing that font dictionary (setEncodingNamed
in Font.m) recognizes the string "WinAnsiEncoding" and stores WinAnsiEncoding (3) from the enum CharacterEncoding (Font.h) in self.encoding; lateron, when converting the raw PDF data to unicode (stringWithPDFString
in SimpleFont.m), it calls and returns
NSString *string = [[NSString alloc] initWithData:rawBytes encoding:self.encoding];
But the encoding constants in nsstring.h map
NSJapaneseEUCStringEncoding = 3,
Thus, PDFKitten here tries to decode the raw data as EUC-JP encoded which for byte values >127 should fail miserably while byte values <= 127 are interpreted as ASCII characters.
NSString initWithData returns nil if the initialization fails for some reason (for example if data does not represent valid data for encoding). Thus, PDFKitten drops the whole fragment while processing the PDF data.
At first glance the relevant code parts still are the same in the current code base. Thus, you might want to report an issue at the PDFKitten site concerning character codes > 127 for fonts with /Encoding /WinAnsiEncoding
and most likely with `/Mac*Encoding' , too.