Encoding ASCII as HTML

Question

I'm trying the DownloadData method from the WebClient. My current problem is that I haven't been able to figure out how to convert the ASCII result (< to <, \n, > to >) which is produced from Encoding.ASCII.GetString(myDataBuffer);, out of this page.

_{(source: iforce.co.nz)}

    /// <summary>
    /// Curl data from the PMID
    /// </summary>
    private void ClientPMID(int pmid)
    {
        //generate the URL for the client
        StringBuilder pmid_url_string = new StringBuilder();
        pmid_url_string.Append("http://www.ncbi.nlm.nih.gov/pubmed/").Append(pmid.ToString()).Append("?report=xml");
        Uri PMIDUri = new Uri(pmid_url_string.ToString());
        //declare and initialize the client
        WebClient client = new WebClient();
        // Download the Web resource and save it into a data buffer. 
        byte[] myDataBuffer = client.DownloadData(PMIDUri);
        this.DownloadCompleted(myDataBuffer);
    }
    /// <summary>
    /// Crawl over the binary from myDataBuffer
    /// </summary>
    /// <param name="myDataBuffer">Binary Buffer</param>
    private void DownloadCompleted(byte[] myDataBuffer)
    {
        string download = Encoding.ASCII.GetString(myDataBuffer);
        PMIDCrawler pmc = new PMIDCrawler(download, "/pre/PubmedArticle/MedlineCitation/Article");
        //iterate over each node in the file
        foreach (XmlNode xmlNode in pmc.crawl)
        {
            string AbstractTitle = xmlNode["ArticleTitle"].InnerText;
            string AbstractText = xmlNode["Abstract"]["AbstractText"].InnerText;
        }
    }

Code for PMIDCrawler is available on my other SO question regarding the DownloadStringCompletedEventHandler. Although output from string html = HttpUtility.HtmlDecode(nHtml); is not valid HTML (OR XML) (Due it not responding to xml http headers), after receiving content from Encoding.ASCII.GetString.

Here is how to do it with javascript for example http://stackoverflow.com/questions/5796718/html-entity-decode — Hogan, Mar 13 '13 at 02:48

score 2 · Accepted Answer · edited May 23 '17 at 12:09

2

Unfortunately this server does not respond correctly to Accept: text/xml or Accept: application/xml so you'll have to do this the hard way (HttpUtility)

string download = HttpUtility.HtmlDecode(Encoding.ASCII.GetString(myDataBuffer));

(or WebUtility.Decode on .NET Fx 4.5+)

or

string download = Encoding.ASCII.GetString(myDataBuffer);
if (download != null) { // this won't get all HTML escaped characters...
    download = download.Replace("&lt;", "<").Replace("&gt;", ">");
}

Also see this question for more information.

edited May 23 '17 at 12:09

Community

1
1

answered Mar 13 '13 at 03:13

cfeduke

23,100
10
61
65

+1 for a good suggestion so far, but anyway to get around the fact that each `attribute` is being escaped? for instance [](http://pastebin.com/hjCwhEhL) – classicjonesynz Mar 13 '13 at 03:16
1

Make sure the `\"` and `\n` you are seeing are not just artifacts of the Visual Studio debugger if you're inspecting a string while at a breakpoint (that used to get me all the time). You can verify with a `Console.WriteLine` if I'm remembering my C#/.NET correctly. – cfeduke Mar 13 '13 at 03:19
Are you certain? `curl --header "Accept:text/html" http://www.ncbi.nlm.nih.gov/pubmed/22918716\?report\=xml` is showing me the HTML entity escaped "XML" but no `\n` nor `\"` tokens. – cfeduke Mar 13 '13 at 03:22
I do have a feeling the `Encoding.ASCII` is causing the characters to become escaped, I'm not 100% though. – classicjonesynz Mar 13 '13 at 03:22
you're correct, it must be an artifact. I was checking at the wrong time during debugging. – classicjonesynz Mar 13 '13 at 03:25
`Encoding.ASCII` shouldn't be to blame here since that's a honest to goodness conversion from a byte array to the associated ASCII characters. If anything I'd suspect `HttpUtility.HtmlDecode` first so try the `download.Replace...` if you're 100% certain its not just VS "helpfully" representing a string literal – cfeduke Mar 13 '13 at 03:25

Encoding ASCII as HTML

1 Answers1