I'm using ItextSharp to read data from a pdf. Inspecting the resulting string looks correct, however string.Replace fails to replace text.
Therefore, I'm guessing this is some sort of encoding issue, but I'm failing to pin it down.
My code to import the text from PDF should convert into UTF8
PdfReader pdfReader = new PdfReader("file.pdf");
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.AppendLine(currentText);
}
pdfReader.Close();
Then I am trying to replace three hyphens and a space (-- -) into just 3 hyphens (---)
input = input.Replace("-- -", "---");
Removing the utf8 conversion from the PDF import does not make a difference (see screenshot below - breakpoint after the replace function, but the text is still there):
EDIT:
Here is a link to a sample file. When opened in notepad or ++, it displays a series of spaces and hyphens (see npp screenshot with whitespace rendering). However when read in c# this file does not get interpreted as unicode hyphen and Unicode space.