getting text from a pdf document itextsharp

Question

i tried using iTextSharp to get the text from a pdf document, it works great if the pdf file is with english text(latin chars). If i try to get the text from a pdf doc with cyrillic characters the output is just question marks. Are there some settings to be made, or cyrillic isnt supported? this is the code for creating the pdf:

string testText = "зззi";
        string tmpFile = @"C:\items\test.pdf";
        string myFont = @"C:\windows\fonts\verdana.ttf";
        iTextSharp.text.Rectangle pgeSize = new iTextSharp.text.Rectangle(595, 792);
        iTextSharp.text.Document doc = new iTextSharp.text.Document(pgeSize, 10, 10, 10, 10);
        iTextSharp.text.pdf.PdfWriter wrtr;
        wrtr = iTextSharp.text.pdf.PdfWriter.GetInstance(doc,
            new System.IO.FileStream(tmpFile, System.IO.FileMode.Create));
        doc.Open();
        doc.NewPage();
        iTextSharp.text.pdf.BaseFont bfR;

        bfR = BaseFont.CreateFont(myFont, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);

        iTextSharp.text.BaseColor clrBlack =
            new iTextSharp.text.BaseColor(0, 0, 0);
        iTextSharp.text.Font fntHead =
            new iTextSharp.text.Font(bfR, 34, iTextSharp.text.Font.NORMAL, clrBlack);

        iTextSharp.text.Paragraph pgr =
            new iTextSharp.text.Paragraph(testText, fntHead);
        doc.Add(pgr);
        doc.Close();

this is the code for retrieving the text:

PdfReader reader1
            = new PdfReader("c:/items/test.pdf");
        Console.WriteLine(PdfTextExtractor.GetTextFromPage(reader1, 1, new SimpleTextExtractionStrategy()));
        Console.ReadLine();

the output is: ???i

EDIT 2 i managed to read text from the pdf i created, but still cant get the text from a random pdf. How can i check if that pdf provides the required info for text extraction?

Possible duplicate of [iTextSharp international text](http://stackoverflow.com/questions/1727765/itextsharp-international-text) — xZ6a33YaYEfmv, Oct 16 '15 at 20:32
@ieaglle i saw that thread earlier but that case is for creating pdf documents and setting the font we want to use. I dont think that will work when reading from the pdf? — student, Oct 16 '15 at 20:41
You neither show your code nor sample pdfs. Thus, all one can say is that you either are doing something completely wrong or your PDF does not provide the information required for text extraction. — mkl, Oct 16 '15 at 20:44
@mkl i apologize for the badly formed question :) i added the code i am using, Thanks for your time. — student, Oct 16 '15 at 21:27
*How can i check if that pdf provides the required info for text extraction?* - a good first test is trying to copy& paste from Adobe Reader. If that does not work, generic text extraction usually won't work. — mkl, Oct 17 '15 at 09:36
@mkl thanks for the help, i think the problem is that the appropriate fonts are not embedded in the pdf document. Is there a way to embed a font in an existing pdf with itextsharp? thanks in advance — student, Oct 18 '15 at 13:38
Usually text extraction does not use information from the font program itself but instead from information stored in PDF objects. Thus, I doubt embedding will help. If you want to try nonetheless, look at the example [EmbedFontPostFacto.cs](http://sourceforge.net/p/itextsharp/code/HEAD/tree/book/iTextExamplesWeb/iTextExamplesWeb/iTextInAction2Ed/Chapter16/EmbedFontPostFacto.cs) analog to `EmbedFontPostFacto.java` from section 16.1 in *iText in Action, 2nd edition.* — mkl, Oct 19 '15 at 14:00
@mkl Thank you for the help, the problem was with the pdf document, i succeeded with another file :) Thanks! :) — student, Oct 19 '15 at 18:21

getting text from a pdf document itextsharp

0 Answers0