0

My problem:

I converted an HTML to plain text using this method... it takes in a .html file(this html file is .msg of outlook converted to .html) and then I removed all the tags using regex expressions.

    public string ReadEmailTemplate(string EmailTemplateFilePath)
    {

        return File.ReadAllText(EmailTemplateFilePath);

    }

but I am seeing a black diamond with white question mark inside it after removing all the html tags. I know that this happens when it is an unknown character. What I needed to do is that I need to remove those from the string. Is it possible using c# codes? I've tried this method to remove them but it did not remove those black question mark diamond..

public string replaceBlackQuestionMark(string output)
    {
        while(output.Contains('�'))
        {
            output = output.Replace("�", "");
        }
        return output;
    }

This is the output of the string in a messageBox. It contains black diamond with white question marks.

Image

keinz
  • 103
  • 3
  • 15
  • _I converted a HTML to plain text_ can you show us how you did that? Your problem it's likely an encoding issue. – StepTNT May 06 '20 at 07:56
  • Is it probably unknown character indicated by other encoding - check what encoding you use in source, and in code. – Leszek Mazur May 06 '20 at 07:56
  • @StepTNT edited the question , added how I got to the final output. – keinz May 06 '20 at 08:01
  • Thank you! @StepTNT , I think this solves the question. =) – keinz May 06 '20 at 08:05
  • The unknown characters are not within the character set you're dealing with and much more than the characters you would like to keep. In this case whitelisting is probably a better approach than blacklisting by taking ASCII only, for example. – Zephyr May 06 '20 at 08:05

0 Answers0