iTextSharp returning ????? when extracting Text from PDF

Question

I'm using ITextSharp with the follow command to extract text from pdf and it was working well. However today I received an different pdf and that resulted in extracting alot of ? ? ? ?.

Does anybody knows why that's happening? Is there anyway to at least check if the pdf can't be extracted?

StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(arquivo);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
    currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
    text.Append(currentText);
}
pdfReader.Close();
return text.ToString();

What is your text encoding conversion trying to do? convert something to utf8? before the conversion, is the string valid? — Philippe Paré, Aug 26 '15 at 21:03
This is likely a font/encoding issue. Is the font installed on your development machine or server? Are you using the correct encoding? — mjw, Aug 26 '15 at 21:05
I'm not the one whos generating the pdf, I'm only extracting the text from a pdf that I received. — Felipe Santiago, Aug 26 '15 at 21:10
I don't know, If I'm using the correct enconding to extract, but it worked with all others files(and there was a lot of them), I will try to change it. — Felipe Santiago, Aug 26 '15 at 21:14
It's not a encoding problem. The text that GetTextFromPage returns is already ????? only. string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy); — Felipe Santiago, Aug 26 '15 at 21:28
Can you share the PDF in question? Most likely it does not contain the information required for text extraction but one has to check. — mkl, Aug 27 '15 at 04:09
I can absolutely tell you that the `currentText = Encoding.UTF8.GetString...` at best does nothing but more than likely is destroying your strings. It does not do what you think it does and should be removed. See [this for further discussion](http://stackoverflow.com/a/10191879/231316). Removing it might not fix your problem but there is no reason to keep it. — Chris Haas, Aug 27 '15 at 13:04

iTextSharp returning ????? when extracting Text from PDF

0 Answers0