I am trying to read text from a PDF into a string using the iTextSharp library.
iTextSharp.text.pdf.PdfReader pdfReader = new iTextSharp.text.pdf.PdfReader(@"C:\mypdf.pdf");
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy);
text = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
pdfReader.Close();
Console.WriteLine(text);
This normally works OK, but every few lines the whitespace will be omitted, leaving me with output like: "thisismyoutputwithoutwhitespace". The text that parses correctly seems to be the same as the text that doesn't; the same text will consistently be parsed incorrectly, which makes me think it's something within the PDFs.