i tried using iTextSharp to get the text from a pdf document, it works great if the pdf file is with english text(latin chars). If i try to get the text from a pdf doc with cyrillic characters the output is just question marks. Are there some settings to be made, or cyrillic isnt supported? this is the code for creating the pdf:
string testText = "зззi";
string tmpFile = @"C:\items\test.pdf";
string myFont = @"C:\windows\fonts\verdana.ttf";
iTextSharp.text.Rectangle pgeSize = new iTextSharp.text.Rectangle(595, 792);
iTextSharp.text.Document doc = new iTextSharp.text.Document(pgeSize, 10, 10, 10, 10);
iTextSharp.text.pdf.PdfWriter wrtr;
wrtr = iTextSharp.text.pdf.PdfWriter.GetInstance(doc,
new System.IO.FileStream(tmpFile, System.IO.FileMode.Create));
doc.Open();
doc.NewPage();
iTextSharp.text.pdf.BaseFont bfR;
bfR = BaseFont.CreateFont(myFont, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
iTextSharp.text.BaseColor clrBlack =
new iTextSharp.text.BaseColor(0, 0, 0);
iTextSharp.text.Font fntHead =
new iTextSharp.text.Font(bfR, 34, iTextSharp.text.Font.NORMAL, clrBlack);
iTextSharp.text.Paragraph pgr =
new iTextSharp.text.Paragraph(testText, fntHead);
doc.Add(pgr);
doc.Close();
this is the code for retrieving the text:
PdfReader reader1
= new PdfReader("c:/items/test.pdf");
Console.WriteLine(PdfTextExtractor.GetTextFromPage(reader1, 1, new SimpleTextExtractionStrategy()));
Console.ReadLine();
the output is: ???i
EDIT 2 i managed to read text from the pdf i created, but still cant get the text from a random pdf. How can i check if that pdf provides the required info for text extraction?