0

I am trying to get specific information from a website. Right Now I have this html string as you can see my code, the html source code of the website is placed in "responseText". I know I can do this with If's statement but it would be really tedious. I'm a newbie so I have no idea what I'm doing with this. I'm sure there must be another easier way to retrieve information from a website... This is c# for windows store so I can't use webclient. This codes get the string but isn't there is a way I can remove the html code and only leave the variables or something? I just want to do this for a webpage and I know the variables I want because I looked at the html code of the webpage. Isn't it a way to request a list of variables with its information from the website? I'm just kind of lost here. So basically I just want to get specific information from a website in c#, I'm making an app for windows store.

     StringBuilder sb = new StringBuilder();
     // used on each read operation
    byte[] buf = new byte[8192];
    // prepare the web page we will be asking for
    HttpClient searchClient;
    searchClient = new HttpClient();
    searchClient.MaxResponseContentBufferSize = 256000;
    HttpResponseMessage response = await searchClient.GetAsync(url);
    response.EnsureSuccessStatusCode();
    responseText = await response.Content.ReadAsStringAsync();
chue x
  • 18,573
  • 7
  • 56
  • 70
David
  • 646
  • 4
  • 14
  • Most developers would probably use regular expressions to parse the HTML response from the website and extract the values of interest. Have a look at using regular expressions. – Mike Panter Jun 04 '13 at 13:19
  • @MikePanter: Developers using regular expressions to parse HTML should be very aware of how brittle that approach is. I'd much rather use something like HTML Tidy. – Jon Skeet Jun 04 '13 at 13:20
  • Note that you should have `using` statements to ensure that you dispose of your `HttpClient` and `HttpResponseMessage` properly, and you don't use `buf` at all. Also, consider just using `HttpClient.GetStringAsync` instead of using the response message directly. – Jon Skeet Jun 04 '13 at 13:20
  • @JonSkeet: That depends on how you write your regular expressions! I would argue any attempt to parse a third party site was brittle, irrespective of the parsing technology. Regular expressions are easy to maintain. – Mike Panter Jun 04 '13 at 13:21
  • 1
    @MikePanter: You must have used different regexes to me then... I would much rather work with a document model of some description. See http://stackoverflow.com/questions/1732348 – Jon Skeet Jun 04 '13 at 13:23
  • @JonSkeet: We all would rather that. It doesn't make it more practical. What if the HTML is malformed? How does a document model work then? With regex, you have a choice over which elements of the page are critical to be considered successful parsing. You're then the one defining what is semantically acceptable, as opposed to relying on the correctness of a third-party's html. – Mike Panter Jun 04 '13 at 13:32
  • @MikePanter: That's what HTML Tidy is for - handling various kinds of brokenness and converting it into a nicer format. Obviously there are limits, but if it's too broken for HTML Tidy to handle, I really don't want to parse it... – Jon Skeet Jun 04 '13 at 13:36
  • @jonSkeet: let's chat instead. – Mike Panter Jun 04 '13 at 13:37
  • I doubt there's any point - I don't think either of us is going to convince the other. – Jon Skeet Jun 04 '13 at 13:37

1 Answers1

0

This codes get the string but isn't there is a way I can remove the html code and only leave the variables or something?

What "variables"? You get the HTML - that's the response from the web server. If you want to strip that HTML, that's up to you. You might want to use HTML Tidy to make it more pleasant to work with, but the business of extracting relevant information from HTML is up to you. HTML isn't designed to be machine-readable as a raw information source - it's meant to be mark-up to present to humans.

You should investigate whether the information is available in a more machine-friendly source, with no presentation information etc. For example, there may be some way of getting the data as JSON or XML.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • So you are saying I need to make the if statements and format it, there is no other easier way? – David Jun 04 '13 at 13:22
  • @user1713352: I have no idea what you mean by "make the if statements and format it" - partly because you've given us very little indication of what you're trying to do. But no, extracting information from HTML (particularly HTML you don't control) isn't particularly simple - which is why I suggested you look for the same information being published in a more friendly format. – Jon Skeet Jun 04 '13 at 13:25