2

I'm trying the DownloadData method from the WebClient. My current problem is that I haven't been able to figure out how to convert the ASCII result (&lt; to <, \n, &gt; to >) which is produced from Encoding.ASCII.GetString(myDataBuffer);, out of this page.

pagesource
(source: iforce.co.nz)

    /// <summary>
    /// Curl data from the PMID
    /// </summary>
    private void ClientPMID(int pmid)
    {
        //generate the URL for the client
        StringBuilder pmid_url_string = new StringBuilder();
        pmid_url_string.Append("http://www.ncbi.nlm.nih.gov/pubmed/").Append(pmid.ToString()).Append("?report=xml");
        Uri PMIDUri = new Uri(pmid_url_string.ToString());
        //declare and initialize the client
        WebClient client = new WebClient();
        // Download the Web resource and save it into a data buffer. 
        byte[] myDataBuffer = client.DownloadData(PMIDUri);
        this.DownloadCompleted(myDataBuffer);
    }
    /// <summary>
    /// Crawl over the binary from myDataBuffer
    /// </summary>
    /// <param name="myDataBuffer">Binary Buffer</param>
    private void DownloadCompleted(byte[] myDataBuffer)
    {
        string download = Encoding.ASCII.GetString(myDataBuffer);
        PMIDCrawler pmc = new PMIDCrawler(download, "/pre/PubmedArticle/MedlineCitation/Article");
        //iterate over each node in the file
        foreach (XmlNode xmlNode in pmc.crawl)
        {
            string AbstractTitle = xmlNode["ArticleTitle"].InnerText;
            string AbstractText = xmlNode["Abstract"]["AbstractText"].InnerText;
        }
    }

Code for PMIDCrawler is available on my other SO question regarding the DownloadStringCompletedEventHandler. Although output from string html = HttpUtility.HtmlDecode(nHtml); is not valid HTML (OR XML) (Due it not responding to xml http headers), after receiving content from Encoding.ASCII.GetString.

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
classicjonesynz
  • 4,012
  • 5
  • 38
  • 78
  • 1
    Here is how to do it with javascript for example http://stackoverflow.com/questions/5796718/html-entity-decode – Hogan Mar 13 '13 at 02:48

1 Answers1

2

Unfortunately this server does not respond correctly to Accept: text/xml or Accept: application/xml so you'll have to do this the hard way (HttpUtility)

string download = HttpUtility.HtmlDecode(Encoding.ASCII.GetString(myDataBuffer));

(or WebUtility.Decode on .NET Fx 4.5+)

or

string download = Encoding.ASCII.GetString(myDataBuffer);
if (download != null) { // this won't get all HTML escaped characters...
    download = download.Replace("&lt;", "<").Replace("&gt;", ">");
}

Also see this question for more information.

Community
  • 1
  • 1
cfeduke
  • 23,100
  • 10
  • 61
  • 65
  • +1 for a good suggestion so far, but anyway to get around the fact that each `attribute` is being escaped? for instance [](http://pastebin.com/hjCwhEhL) – classicjonesynz Mar 13 '13 at 03:16
  • 1
    Make sure the `\"` and `\n` you are seeing are not just artifacts of the Visual Studio debugger if you're inspecting a string while at a breakpoint (that used to get me all the time). You can verify with a `Console.WriteLine` if I'm remembering my C#/.NET correctly. – cfeduke Mar 13 '13 at 03:19
  • Are you certain? `curl --header "Accept:text/html" http://www.ncbi.nlm.nih.gov/pubmed/22918716\?report\=xml` is showing me the HTML entity escaped "XML" but no `\n` nor `\"` tokens. – cfeduke Mar 13 '13 at 03:22
  • I do have a feeling the `Encoding.ASCII` is causing the characters to become escaped, I'm not 100% though. – classicjonesynz Mar 13 '13 at 03:22
  • you're correct, it must be an artifact. I was checking at the wrong time during debugging. – classicjonesynz Mar 13 '13 at 03:25
  • `Encoding.ASCII` shouldn't be to blame here since that's a honest to goodness conversion from a byte array to the associated ASCII characters. If anything I'd suspect `HttpUtility.HtmlDecode` first so try the `download.Replace...` if you're 100% certain its not just VS "helpfully" representing a string literal – cfeduke Mar 13 '13 at 03:25