0

I'm getting html source from this url : "http://duhoc.dantri.com.vn/du-hoc/30-hoc-sinh-trung-tuyen-dai-hoc-my-nam-2018-chia-se-bi-kip-thanh-cong-20180418093640358.htm" by :

      private static string getPageSource(string url)
    {
        try
        {
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
            request.UserAgent = "SO/1.0";
            HttpWebResponse response = (HttpWebResponse)request.GetResponse();
            if (response.StatusCode == HttpStatusCode.OK)
            {
                Stream receiveStream = response.GetResponseStream();
                StreamReader readStream = null;

                //if (response.CharacterSet == null)
                //{
                readStream = new StreamReader(receiveStream, Encoding.UTF8);
                //}
                string data = readStream.ReadToEnd();
                response.Close();
                readStream.Close();
                return data;
            }
        }
        catch (Exception ex)
        {
            WriteLog("Exception get Page Source, Ex = " + ex.ToString());
        }
        return null;
    }

The title of the page on browser display like this: "30 học sinh trúng tuyển đại học Mỹ năm 2018 chia sẻ “bí kíp” thành công" but when I get html source from that page by calling method given above the title of the page became "30 học sinh trúng tuyển đại học Mỹ năm 2018 chia sẻ “b&#237 ; k&#237 ; p” th&#224 ; nh c&#244 ; ng". To resolve this I've change UTF8 tobe:

      Encoding encode = System.Text.Encoding.GetEncoding(1255)

and UTF7,UTF32 but nothing is working.So, what am I doing wrong?

user2905416
  • 404
  • 7
  • 21
  • 2
    I see no differences in the strings you pasted? The website does state it its UTF-8 at least content="text/html; charset=utf-8" – dsdel Apr 18 '18 at 04:40
  • it is because stackoverflow did correct the paragraph, it should be "30 học sinh trúng tuyển đại học Mỹ năm 2018 chia sẻ “bí ; kí ; p” thà ; nh cô ; ng" – user2905416 Apr 18 '18 at 04:43
  • From the HTML in the "View source" window: `

    30 học sinh trúng tuyển đại học Mỹ năm 2018 chia sẻ “bí kíp” thành công

    `. The HTML actually contains these values.
    – ProgrammingLlama Apr 18 '18 at 04:53
  • @john : yes, but how browser display it right? how can I convert that title to be what I see It on browser? – user2905416 Apr 18 '18 at 04:56
  • You need to apply System.Web.HttpUtility.HtmlDecode on your string to resolve the HTML character entities. – ckuri Apr 18 '18 at 04:59
  • 1
    Possible duplicate of [How can I decode HTML characters in C#?](https://stackoverflow.com/questions/122641/how-can-i-decode-html-characters-in-c) – ProgrammingLlama Apr 18 '18 at 04:59
  • @ john : yes, HttpUtility.HtmlDecode resolve this – user2905416 Apr 18 '18 at 05:28

0 Answers0