1

I need to get HTML source code as possible as near to normal page view source of chrome or other browser. But following code returns different code for same URL.

String url = @"https://m.facebook.com";
try
{
   HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
   HttpWebResponse response = (HttpWebResponse)request.GetResponse();
   if (response.StatusCode == HttpStatusCode.OK)
   {
      Stream receiveStream = response.GetResponseStream();
      StreamReader readStream = null;
      if (response.CharacterSet == null)
         readStream = new StreamReader(receiveStream);
      else
         readStream = new StreamReader(receiveStream,
                     Encoding.GetEncoding(response.CharacterSet.Replace("\"", string.Empty)));
      //readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
      string data = readStream.ReadToEnd();
      response.Close();
      readStream.Close();
      //string[] sps = data.Split(new string[] { @"videoId"":""" }, StringSplitOptions.RemoveEmptyEntries);
   }
}
catch (Exception ex)
{}

It returns following:

<!DOCTYPE html>
<html lang="en" id="facebook" class="no_js">
<head><meta charset="utf-8" /><meta name="referrer" content="default" id="meta_referrer" /><script nonce="EW2LyNr7">window._cstart=+new Date();</script><script nonce="EW2LyNr7">function envFlush(a){function b(b){for(var c in a)b[c]=a[c]}window.requireLazy?window.requireLazy(["Env"],b):(window.Env=window.Env||{},b(window.Env))}envFlush...

But browser source is..

<!DOCTYPE html><html><head><script id="u_0_2" nonce="db1veTby">"use strict";window.MPageLoadClientMetrics=function(){var a=+new Date(),b={prelude_onload:["jewels_visible","first_paint","visibility_change","tti"],nav_started:["first_paint","visibility_change","prelude_onload"],first_paint:["jewels_visible","visibility_change","prelude_onload"],jewels_visible:["tti","visibility_change","navigation","prelude_onload"],tti:["e2e","visibility_change","navigation"]},c=3,d=3,e="nav_started",f=!0,g="",h="",i=1,j="",k="",l="",m=function(){},n=!0,o=!1,p=!1,q=[],r=window.performance||window.msPerformance||window.webkitPerformance||{},s=(window.requestAnimationFrame||window.webkitRequestAnimationFrame||window.mozRequestAnimationFrame||window.oRequestAnimationFrame||window.msRequestAnimationFrame||window.setTimeout).bind(window),t=window.location.origin||window.location.protocol+"//"+window.location.hostname+(window.location.port&&":"+window.location.port);function u(b,c,d,e,f,i){r.timing&&r.timing.navigationStart&&(a=r.timing.navigationStart),j=b,k=c,l=d,g=e,h=f,n=i,x()}function v(a){var c=b[e];return c&&c.indexOf(a)!==-1}function w(a){return!b[a]}function x(){var a,b;do

How can I get the similar code to view page source of chrome?

Sh.Imran
  • 1,035
  • 7
  • 13
Ruwan Liyanage
  • 333
  • 1
  • 13
  • 3
    What you are seeing there in the first blob is the "your browser isn't compatible with facebook because you don't have Javascript capabilities" message. Which makes sense because you are pulling directly from the website programmatically with a socket and that's it (see the class name `no_js`?). Once that site detects Javascript, it will most likely send down a bunch of code to render facebook's real page (the 2nd blob). This is why web-scrapers are now building on top of browsers which know how to run javascript and render a page. Just a guess though. i could be wrong. – Andy Aug 09 '20 at 02:08
  • @Andy thanks for reply. Is there a way to request similar to browser request without actually using a browser? – Ruwan Liyanage Aug 09 '20 at 02:36
  • 1
    No idea -- I know of certain frameworks that you can use C# with to work with a webpage that has been rendered by a browser... one of those frameworks is named `selenium` But, I think that more for automated testing. – Andy Aug 09 '20 at 02:39
  • This question just popped up on SO: https://stackoverflow.com/questions/63322012/is-there-a-way-to-manually-harvest-the-captcha-for-manual-solving This person is using selenium to scrape websites (albeit having problems with captchas), so i guess people *do* use it for that :) – Andy Aug 09 '20 at 02:46
  • I also tried selenium few moths ago for something, but it was not reliable for solution which runs on many different environments and setups as I remember. – Ruwan Liyanage Aug 09 '20 at 02:48

1 Answers1

1

Facebook require User-Agent to display the page properly, otherwise it redirects to /unsupportedbrowser page.

Here's example using HttpClient

class Program
{
    private static readonly HttpClient client = new HttpClient();

    static async Task Main(string[] args)
    {
        client.DefaultRequestHeaders.UserAgent.ParseAdd("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36");

        string result = await client.GetStringAsync("https://m.facebook.com");
        Console.WriteLine(result);
        Console.ReadKey();
    }
}

The output is exactly the same as in Google Chrome.

aepot
  • 4,558
  • 2
  • 12
  • 24
  • The reference for HttpClient is not found, even Microsoft.Net.Http reference is not listed for me. I use .Net 4.00. how to fix this? – Ruwan Liyanage Aug 09 '20 at 10:21
  • 1
    @nawala you need at least .NET Framework 4.5 but 4.6 (or newer) is recommended for it. Why are you using 4.0? Windows XP support? – aepot Aug 09 '20 at 10:27
  • Yes I have to keep it compatible with more user machines. I tried to set headers to HttpWebRequest as your example but could not figure out. I added the line to my code request.Headers.Add("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"); but it gave me error value does not have a : seperator. Any idea how to correct it? – Ruwan Liyanage Aug 09 '20 at 11:50
  • 1
    @nawala consider [this answer](https://stackoverflow.com/a/33659756/12888024). Keep in mind that Windows XP is the only reason to stuck in ancient 4.0. Win7 even x86 fluenty supports 4.6-4.8. – aepot Aug 09 '20 at 11:56
  • 1
    Many thanks @aepot . I marked yours as answer however you shed me a light to find a way to get my code work. I'll try to figure it out with HttpWebRequest. I'm using VS2010 for my projects so there is a another reason to limit. – Ruwan Liyanage Aug 09 '20 at 12:29
  • @nawala installing VS 2019 would be not a problem. Just try it, maybe with a copy of the project to be sure. Also you may fluently stay in 4.0 with VS 2019. Welcome. :) – aepot Aug 09 '20 at 12:36
  • 1
    Yes I also think to go for a newer version. Thanks again – Ruwan Liyanage Aug 09 '20 at 12:48