Retrieve web page content like a browser

Question

After I learned some things about differents technologies, I wanted to make a small project using UWP+NoSQL. I wanted to do a small UWP app that grabs the horoscope and display it on my raspberry Pi every morning.

So I took a WebClient, and I do the following:

WebClient client = new WebClient();
client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2";
string downloadString = client.DownloadString("http://www.horoscope.com/us/horoscopes/general/horoscope-general-daily-today.aspx?sign=2");

But it seems that it detect that this request isn't coming from a browser, since the interesting part is not in the content(and when I check with the browser, it is in the initial HTML, according to fiddler).

I also tried with ScrapySharp but I got the same result. Any idea why?

(I've already done the UWP part, so I don't want to change the topic of my personal project just because it is detected as a "bot")

EDIT

It seems I wasn't clear enough. The issue is **not* that I'm unable to parse the HTML, the issue is that I don't receive expected HTML when using ScrapySharp/WebClient

EDIT2

Here is what I retrieve: http://pastebin.com/sXi4JJRG

And, I don't get(by example) the "Star ratings by domain" + the related images for each stars

I tried to find that XML, however I couldn't. Can you explain a bit where is it? — ganchito55, Mar 26 '16 at 10:17
@ganchito55 sorry, I meant the initial HTML, not XML. I was just saying that it isn't loaded within an ajax call — J4N, Mar 26 '16 at 13:53
What part specifically you want? Can you shared a pastebin or something like this with the part of that web that you want? — ganchito55, Mar 26 '16 at 14:00
@Reddy For what I'm aware, Html Agility Pack is only for the parsing no? — J4N, Mar 27 '16 at 07:34
@ganchito55 I'm trying to retrieve the part with the image. It is contained in the HTML I receive when I'm using chrome, but I don't receive it when I retrieve the web page with `ScrapySharp`/ `WebClient` — J4N, Mar 27 '16 at 07:36
Which image exactly are you missing? It seems most images are there when querying with .NET. — Evk, Mar 31 '16 at 17:30
@Evk I added what I retrieved with the previous code, and a concrete example of a text that is not present in the retrieved page. — J4N, Apr 01 '16 at 05:08
Cannot reproduce that with exact code you provided. I receive full content every time, including "Star ratings by domain" and other stuff. — Evk, Apr 01 '16 at 07:23
Hi @J4N, I've made requst with Fiddler and WebClient. In Both cases I recieve all HTML result like a normal browser. Only images (naturally) are missed. What exactly do you want, and instead do not have? — Glauco Cucchiar, Apr 01 '16 at 14:36
can anyone help with this question? https://stackoverflow.com/questions/52516038/c-sharp-know-the-height-of-html-content-in-a-console-application — Arpit Gupta, Sep 26 '18 at 12:08

score 1 · Answer 1 · answered Mar 26 '16 at 17:02

You can read the entire content of the web page using the code snippet shown below:

internal static string ReadText(string Url, int TimeOutSec)
{
    try
    {
        using (HttpClient _client = new HttpClient() { Timeout = TimeSpan.FromSeconds(TimeOutSec) })
        {
            _client.DefaultRequestHeaders.Accept.Add(new System.Net.Http.Headers.MediaTypeWithQualityHeaderValue("text/html"));
            using (HttpResponseMessage _responseMsg = _client.GetAsync(Url))
            {
                using (HttpContent content = _responseMsg.Content)
                {
                    return content.ReadAsString();
                }
            }
        }
    }
    catch { throw; }
}

Or in a simple way:

public static void DownloadString (string address)
{
    WebClient client = new WebClient ();
    string reply = client.DownloadString (address);

    Console.WriteLine (reply);
}

(re: https://msdn.microsoft.com/en-us/library/fhd1f0sw(v=vs.110).aspx)

Already did this, the issue is that I don't get the same HTML content that when asking with a browser. The server seems to detect that I'm asking with an application and returns me only the structure of the website without the interessting content — J4N, Mar 27 '16 at 07:37

score 1 · Answer 2 · answered Mar 29 '16 at 15:35

1

yes, WebClient won't give you expected result. many sites have scripts to load content. so to emulate browser you also should run page scripts. I have never did similar things, so my answer pure theoretical.

To solve the problem you need "headless browser". I know two project for this (I have never try ony of it):

http://webkitdotnet.sourceforge.net/ - it seems to be outdated

http://www.awesomium.com/

answered Mar 29 '16 at 15:35

Anton Semenov

6,227
5
41
69

But like I mentionned, the content was present in the initial HTML file that I got within fiddler, so it hasn't been loaded within an ajax call – J4N Mar 29 '16 at 15:40
That strange. Just tried your code and got HTML with horoscope. I can assume, you may be banned for many request during testing – Anton Semenov Mar 29 '16 at 16:49
I just tested from another network, another computer, and I got the same result. I posted the result within a pastebin link(in my main question). Are you sure you got the interessting part and not just some HTML? – J4N Mar 30 '16 at 05:18
sorry, was away from computer for last days. check out code I recieved with your code. http://pastebin.com/8cWe15hk – Anton Semenov Apr 04 '16 at 06:49

score 0 · Answer 3 · answered Apr 01 '16 at 17:43

0

Some time ago I used http://www.nrecosite.com/phantomjs_wrapper_net.aspx it worked well, and as Anton mentioned it is a headless browser. Maybe it will be some help.

answered Apr 01 '16 at 17:43

Alan Klimowski

26
4

Do you have any example how to open a page? I guess it's in the PhantomJS part, but I cannot find an example – J4N Apr 02 '16 at 06:24

score 0 · Answer 4 · answered Apr 02 '16 at 18:23

0

I'm wondering if all the 'interesting part' you expect to see 'in the content' are images? You are aware of the fact you have to retrieve any images separately? The fact that a html page contains <image.../> tags does not magically display them as well. As you can see with Fiddler, after retrieving a page, the browser then retrieves all images, style sheets, javascript and all other items that are specified, but not included in the page. (you might need to clear the browser cache to see this happen...)

answered Apr 02 '16 at 18:23

Arnoud van Bokkem

116
5

I'm quite aware of the html. I'm expecting to get the image element and check their URL to define what was the rating. As I shown in the pastebin in the question, it seems there is only the template of the website(header, menu, footer) but not that much content – J4N Apr 02 '16 at 20:29

score 0 · Accepted Answer · edited May 23 '17 at 12:15

Ok, I think I know what's going on: I compared the real output (no fancy user agent strings) to the output as supplied by your pastebin and found something interesting. On line 213, your pastebin has:

<li class="dropdown"><a href="/us/profiles/zodiac/index-profile-zodiac-sign.aspx" class="dropdown-toggle" data-hov...ck">Forecast Tarot Readings</div>

Mind the data-hov...ck near the end. In the real output, this was:

<li class="dropdown"><a href="/us/profiles/zodiac/index-profile-zodiac-sign.aspx" class="dropdown-toggle" data-hover="dropdown" data-toggle="link">Astrology</a>

followed by about 600 lines of code, including the aforementioned 'interesting part'. On line 814, it says:

<div class="bot-explore-col-subtitle f14 blocksubtitle black">Forecast Tarot Readings</div>

which, starting with the ck in black, matches up with the rest of the pastebin output. So, either pastebin has condensed the output or the original output was.

I created a new console application, inserted your code, and got the result I expected, including the 600 lines of html you seem to miss:

static void Main(string[] args)
{
    WebClient client = new WebClient();
    client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2";
    string downloadString = client.DownloadString("http://www.horoscope.com/us/horoscopes/general/horoscope-general-daily-today.aspx?sign=2");

    File.WriteAllText(@"D:\Temp\source-mywebclient.html", downloadString);
}

My WebClient is from System.Net. And changing the UserAgent hardly has any effect, a couple of links are a bit different.

So, to sum it up: Your problem has nothing to do with content that is inserted dynamically after the initial get, but possibly with webclient combined with UWP. There's another question regarding webclient and UWP on the site: (UWP) WebClient and downloading data from URL in that states you should use HttpClient. Maybe that's a solution?

Retrieve web page content like a browser

5 Answers5