In my c# code, I am extracting text from a pdf, but the text it gives back has some weird characters, if I search for "CLE action" when I know there is the text "CLE action" in the pdf document, it gives me a false, but I found out that after extracting the text, the space between the two words has a ascii byte value of 63...
Is there a quick way to fix the encoding on the text?
Currently I am using this method, but I think it's slow and only works for that one character. Is there some fast method that works for all characters?
public static string fix_encoding(string src)
{
StringWriter return_str = new StringWriter();
byte[] byte_array = Encoding.ASCII.GetBytes(src.Substring(0, src.Length));
int len = byte_array.Length;
byte byt;
for(var i=0; i<len; i+=1)
{
byt = byte_array[i];
if (byt == 63)
{
return_str.Write(" ");
}
else
{
return_str.Write(Encoding.ASCII.GetString(byte_array, i, 1));
}
}
return return_str.ToString();
}
This is how I call this method:
StringWriter output = new StringWriter();
output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, page, new SimpleTextExtractionStrategy()));
currentText = fix_encoding(output.ToString());