0

I'm using ITextSharp with the follow command to extract text from pdf and it was working well. However today I received an different pdf and that resulted in extracting alot of ? ? ? ?.

Does anybody knows why that's happening? Is there anyway to at least check if the pdf can't be extracted?

StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(arquivo);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
    currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
    text.Append(currentText);
}
pdfReader.Close();
return text.ToString();

enter image description here

enter image description here

Felipe Santiago
  • 414
  • 6
  • 16
  • 1
    What is your text encoding conversion trying to do? convert something to utf8? before the conversion, is the string valid? – Philippe Paré Aug 26 '15 at 21:03
  • This is likely a font/encoding issue. Is the font installed on your development machine or server? Are you using the correct encoding? – mjw Aug 26 '15 at 21:05
  • I'm not the one whos generating the pdf, I'm only extracting the text from a pdf that I received. – Felipe Santiago Aug 26 '15 at 21:10
  • I don't know, If I'm using the correct enconding to extract, but it worked with all others files(and there was a lot of them), I will try to change it. – Felipe Santiago Aug 26 '15 at 21:14
  • It's not a encoding problem. The text that GetTextFromPage returns is already ????? only. string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy); – Felipe Santiago Aug 26 '15 at 21:28
  • 2
    Can you share the PDF in question? Most likely it does not contain the information required for text extraction but one has to check. – mkl Aug 27 '15 at 04:09
  • 1
    I can absolutely tell you that the `currentText = Encoding.UTF8.GetString...` at best does nothing but more than likely is destroying your strings. It does not do what you think it does and should be removed. See [this for further discussion](http://stackoverflow.com/a/10191879/231316). Removing it might not fix your problem but there is no reason to keep it. – Chris Haas Aug 27 '15 at 13:04

0 Answers0