0

I am using the code from this post: Get HTML code from website in C#

to save the html in a string

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
    Stream receiveStream = response.GetResponseStream();
    StreamReader readStream;
    if (response.CharacterSet == null)
        readStream = new StreamReader(receiveStream);
    else
        readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
    string data = readStream.ReadToEnd();
    response.Close();
    readStream.Close();

    msgBox.Text = data;
}

However the page I am trying to read has a temporary loader page, how can I get around this that it tries to save the html again after this page is actually loaded?

Best regards

Community
  • 1
  • 1
WtFudgE
  • 5,080
  • 7
  • 47
  • 59

2 Answers2

2

the page I am trying to read has a temporary loader page

It all depends on what that means and how that "temporary loader page" works. For example, if that page is (whether from JavaScript code or some HTML META redirect) making a request to the destination page, than that request is what you need to capture. Currently you're reading from a given URL:

(HttpWebRequest)WebRequest.Create(url)

This is essentially making a GET request to that URL and reading the response. But based on your description it sounds like that's the wrong URL. It sounds like there's a second URL which contains the actual information you're looking for.

Given that, you essentially have two options:

  1. Determine what that other URL is manually from visiting the page and inspecting the requests in your browser and use that as the value of url in your code.
  2. Determine how that other URL is itself determined by the page code of the first URL (is it something embedded in the page source somewhere?), parse it out of the response you get from the first url value, and make a second request to the new URL.

Clearly the first option is a lot easier. The second is only necessary if that second URL changes with each visit or is expected to change frequently over time. If that's the case then you'd have to basically reverse-engineer how the website is performing the second request so you can perform it as well.

Web scraping can get complicated pretty quickly, and often turns into a game of cat and mouse (even unintentionally and mutually unaware) between the person scraping the content and the person hosting the content (who might not want it to be scraped).

David
  • 208,112
  • 36
  • 198
  • 279
  • Thanks for your reply, however I already checked if the url changes and it doesn't. It's just exactly the same. My guess it that it uses some kind of javascript loading of some sort, but I can't seem to get around it. Is there maybe a way to open that page and wait a few seconds before reading the html code? – WtFudgE Jun 25 '14 at 19:15
  • @WtFudgE: I'm not sure what "waiting a few seconds" is meant to accomplish. What you need to determine is if the initial page actually *contains* the data you're looking for (maybe it's just styled not to be visible and then later is made visible via JavaScript), or whether it loads that data from a *separate* call to the server. The URL in the browser's address bar might not change, but if the data you're looking for is coming from a separate request then *that's* the request you want to make. Check your browser's debugging tools to examine network requests. – David Jun 25 '14 at 19:17
0

why don't you use webbrowser and make delay with

await Task.Delay(n)
FelixSFD
  • 6,052
  • 10
  • 43
  • 117