In PDFKitten Special Character Searching not possible

Question

I am using PDFKitten search functionality and found that in this the Special Characters are not able to search (e.g. let say in PDF ther is a word RAVI's and if I search the word it will return NULL value. Please suggest me how do I resolve this issue.

Thanks

In scanner.m there is a function didScanString

void didScanString(CGPDFStringRef pdfString, Scanner *scanner)
{
     NSString *tempStr = (NSString *)CGPDFStringCopyTextString(pdfString);
     NSLog(@"ScanString==%@",tempStr);

NSString *string = [[scanner stringDetector] appendPDFString:pdfString withFont:[scanner currentFont]];
    NSLog(@"didScanString====>>>%@",string);
    [ss appendString:string];
    [[scanner content] appendString: string];
    //NSLog(@"TOTAL: %@",[scanner content]);

}

For example the Search PDF String is MGR KL445's In the above function two NSLog out put 1st one showing ScanString==MGR KL445™s and 2nd one nothing will showing.

Your edit seems to suggest that while the string encoding in the PDF is nearly a standard encoding (the first log outputs the raw data from the page content), PDFKitten does not seem to have access to complete encoding information (the second log outputs the decoded data). The reasons may still be anything mentioned in my answer. The PDF needs to be inspected. — mkl, Jan 22 '13 at 05:18
... (continued) Ok, the reason may not be anything from my answer, using different types of apostrophesin the search will not help, and the special chars don't seem to be from a different font. The other options still are open, though, and so the PDF itself needs to be inspected. — mkl, Jan 22 '13 at 05:33
Here is the PDF file link https://www.filesanywhere.com/fs/v.aspx?v=8a726b865d6573b5a3ac — Subodh S, Jan 22 '13 at 06:31
I'm currently inspecting the PDF. It seems, though, that you are not using the most current PDFKitten version. The current Scanner.m does not contain a `didScanString` method anymore, it has been refactored out of it some 25 days ago. This makes me wonder which PDFKitten version you do have. Please specify your version or check out the most current version and try to reproduce your issue. — mkl, Jan 22 '13 at 08:14
Yes this one is almost 3 months old. Just I have download the new one and in this I have faced the same problem. In PDF I have changed the apostrophe;( ' ) font type to "Arial" and now texts are searching without apostrophe;( ' ) but after apostrophe; the text searching selection color will be showing little left side of the searching text. — Subodh S, Jan 22 '13 at 09:28
If you only take the apostrophe from Arial and the other letters around are still using GEInspira, and if Arial is embedded in the same way as GEInspira, that behaviour is to be expected: In that case the apostrophe is a text element all by itself for which the same decoding issue exists. Thus, it (and its width!) is ignored by PDFKitten, and on its line all markings on its right side are shifted left by that missing width. — mkl, Jan 22 '13 at 09:42

score 1 · Answer 1 · edited May 23 '17 at 12:07

1

Your search text RAVI's contains a vertical apostrophe; have you checked whether the PDF does not contain a forward or backward slanted version of that character instead? Those different versions have different character codes after all.

In the context of question PDFKitten is highlighting on wrong position, it became appearant that that library returns ligatures as single ligatured characters instead of 'de-ligaturized' character groups. If your text contains ligatures, that might be the reason.

In the context of the same question, PDFKitten font data parsing turned out to be deficient in some respects. In reaction to that question, a workaround for one such deficiency has been added to the code which in my eyes didn't fix the general case, merely some special cases, cf. the proposal in my answer there.

Furthermore some fonts simply do not contain the information for mapping their glyphs back to unicode characters. You say the Special Characters are not able to search --- maybe those special characters are taken from a different font not supporting parsing.

Theoretically the apostrophe might even have been drawn using graphic, non-text operators. In that case text parsing won't find it.

If none of these ideas explain your case (or you cannot check whether they do), please supply the sample PDF for inspection.

EDIT (taking into account your Brivo MR355 copy.pdf sample file)

I assume that again the apostrophe is troublesome, this time in MR355’s. There are two accurances in the raw page content,

/TT1 1 Tf
0.559 0 Td
(Brivo MR355\222s Ready Bar technology replaces 30 complex inputs with a single control, simplifying scan optimization )Tj

and

/TT1 1 Tf
0.559 0 Td
(Brivo MR355\222s Ready Interface)Tj

Both time the font resource /TT1 is used, and both times the apostrophe is encoded as \222 which is octal for 146 decimal, quoteright in WinAnsiEncoding, trademark in PDFDocEncoding.

/TT1 is

/LastChar   146
/BaseFont   /REEDOQ+GEInspira
/Type   /Font
/Subtype    /TrueType
/Encoding   /WinAnsiEncoding
/Widths [232, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 198, 0, 0, 0, 530, 0, 0, 530, 0, 530, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 570, 0, 0, 0, 0, 0, 0, 243, 0, 0, 0, 764, 0, 0, 0, 0, 556, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 545, 545, 482, 545, 509, 297, 545, 544, 210, 0, 0, 210, 836, 544, 537, 545, 545, 341, 437, 317, 544, 474, 736, 471, 474, 427, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 190]
/FontDescriptor 32 0 R
/FirstChar  32

/LastChar being 146 and /Encoding being /WinAnsiEncoding should make it easy for PDFKitten to recognize the \222 as quoteright character.

As one of your comments indicated that you are not using the most current PDFKitten version, I'll do the code analysis based on an older copy, too.

PDFKitten while parsing that font dictionary (setEncodingNamed in Font.m) recognizes the string "WinAnsiEncoding" and stores WinAnsiEncoding (3) from the enum CharacterEncoding (Font.h) in self.encoding; lateron, when converting the raw PDF data to unicode (stringWithPDFString in SimpleFont.m), it calls and returns

NSString *string = [[NSString alloc] initWithData:rawBytes encoding:self.encoding];

But the encoding constants in nsstring.h map

NSJapaneseEUCStringEncoding     = 3,

Thus, PDFKitten here tries to decode the raw data as EUC-JP encoded which for byte values >127 should fail miserably while byte values <= 127 are interpreted as ASCII characters.

NSString initWithData returns nil if the initialization fails for some reason (for example if data does not represent valid data for encoding). Thus, PDFKitten drops the whole fragment while processing the PDF data.

At first glance the relevant code parts still are the same in the current code base. Thus, you might want to report an issue at the PDFKitten site concerning character codes > 127 for fonts with /Encoding /WinAnsiEncoding and most likely with `/Mac*Encoding' , too.

edited May 23 '17 at 12:07

Community

1
1

answered Jan 21 '13 at 12:09

mkl

90,588
15
125
265

I have add one example with the question. Please have a look and let me know. – Subodh S Jan 22 '13 at 05:12
Here is the PDF link https://www.filesanywhere.com/fs/v.aspx?v=8a726b865d6573b5a3ac – Subodh S Jan 22 '13 at 05:55
@ReeganSs I've added an analysis of what most likely happens when PDFKitten analyzes your sample file. As I'm currently at a Windows machine, I cannot verify the analysis, though. – mkl Jan 22 '13 at 09:34
Thank you. So I will post this problem to PDFKitten – Subodh S Jan 22 '13 at 11:42
Can you provide me any solution for this problem? – Subodh S Jan 22 '13 at 12:58
You might want to change the `NSString *string = [[NSString alloc] initWithData:rawBytes encoding:self.encoding];` line to use the respectively appropriate encoding constant from nsstring.h instead of `self.encoding` directly, i.e. map the CharacterEncoding (Font.h) self.encoding to a matching `NS*StringEncoding` (nsstring.h) and use that value as encoding in initWithData. – mkl Jan 22 '13 at 14:17
NSString *string = [[NSString alloc] initWithData:rawBytes encoding:NSWindowsCP1252StringEncoding]; NSLog(@"string:%@",string); Here this will show the 355's properly but not searching – Subodh S Jan 23 '13 at 05:42
Could you please tell me how do you inspect the PDF file? and where I will get those details? – Subodh S Jan 23 '13 at 05:45
@SubodhS I use RUPS, a Java PDF internals browser based on iText; there are also other such tools around. – mkl Jan 23 '13 at 06:23
@SubodhS I don't know whether code page 1252 corresponds 100% with WinAnsi in PDF. You might even have to add a dedicated own conversion method. – mkl Jan 23 '13 at 06:31
If the logging now shows the *355's* properly but the search doesn't match it, the problem might be different unicode codes for the apostrophe (there are different unicode characters looking like an apostrophe, some vertical, some slanted left or right, ...). Thus debug or add appropriate logging of characters and their unicode value to `appendPDFString` or `append` in StringDetector.m – mkl Jan 23 '13 at 08:06
In Both the function the PDF text contains apostrophe. If I search a text e.g "Brivo" it is searching but the apostrophe space is not adding to get the exact highlighted area. Could you please tell me in which function the text position is calculated. – Subodh S Jan 23 '13 at 12:31
Please do log both the PDF text and the search text, and please do also log the Unicode value of the apostrophe in each of them. – mkl Jan 23 '13 at 19:20
I have checked the Unicode values for the PDF text is 8217 and searching text is 39. – Subodh S Jan 24 '13 at 06:32
So the PDF contains a [RIGHT SINGLE QUOTATION MARK](http://www.fileformat.info/info/unicode/char/2019/index.htm) and you are searching for an [APOSTROPHE](http://www.fileformat.info/info/unicode/char/0027/index.htm). Thus, PdfKitten is completely right to **not** call it a match. If you want a less precise search, you need to change StringDetector.m, at least the comparison *([self nextCharacter:isLast] != [string characterAtIndex:i])* in append but maybe more. This, BTW, was what I hinted at in the first paragraph of my answer... – mkl Jan 24 '13 at 08:16
*Concerning the incorrect search markings right of the quote mark...* That most likely is due to the deficiency in PdfKitten concerning which in my answer I referred to another question on stackoverflow: PdfKitten does not properly find the glyph widths for glyphs whose identifier in the PDF is not their unicode character code (in your case: identiofier 146 but unicode code 8217). In my answer to that question you'll find my idea on how that deficiency can be fixed. – mkl Jan 24 '13 at 08:38
I have solved the issue. Just change the apostrophe into the PDF by using Character Map in Windows and edit the PDF text in adobe illustrator. Now it is working properly. – Subodh S Jan 25 '13 at 09:16

In PDFKitten Special Character Searching not possible

Thanks

1 Answers1