3

I am using HtmlAgilityPack to perform Scraping in C # Asp.Net, so far I have not had problems when doing Scratch from several web, however, trying to eject the following code I get an error

Var getHtmlWeb = new HtmlWeb ();
Var home page = getHtmlWeb.Load ("https://www.corfo.cl/sites/cpp/home");

The error that appears is:

"Connection terminated: Unexpected sending error."

The only web that is giving me problems is Corfo and not how to solve this. I appreciate your help

Andrés
  • 31
  • 3
  • I acknowledge that this web site do reset connection for C# with htmlagilitypack. When request with firefox, it's fine, only one image missing, which is fine. This url has the same connection reset - "https://www.corfo.cl/sites/cpp/home". – Herbert Yu May 24 '17 at 19:50
  • Is this your web site? How this site works? It seems for me that this web site set a cookie from /sites, and guess, get this cookie again, if not there, reset connection. But I didn't check detailed JavaScript for that. – Herbert Yu May 24 '17 at 19:53
  • 1
    @HerbertYu The ideal is to use HtmlAgilityPack for data extraction, but you can think of another way to perform Scraping to "https: //www.corfo.cl/sites/cpp/home" – Andrés May 24 '17 at 19:59

1 Answers1

1

This site relies on cookie to work, e.g. one of the URL it requested is https://www.corfo.cl/sites/Satellite;jsessionid=T8w78ZolfWgr3ZoEBBvE81nBiXbXIdjfF1In3bgpZiYvL_w8TF4p!1081543155!-596930586?c=Page&cid=1456408322328&pagename=CorfoPortalPublico/Page/corfoListadoOfertaInteligenteWebLayout

So, when you request www.corfo.cl, first it forward to www.corfo.cl/sites/cpp/home, then on /sites/ folder, it set cookie jsessionid=OHS_1~T8w78ZolfWgr3ZoEBBvE81nBiXbXIdjfF1In3bgpZiYvL_w8TF4p!1081543155!-596930586 etc.

With this cookie, this page build itself with all/some components related with this jsessionid.

If client code doesn't handle these logic, as above two lines, the server reset the connection as expected, because server doesn't know how to build this page without jsessionid.

The inner exception from System.Net.WebException is {"Authentication failed because the remote party has closed the transport stream."}

Hope this helps!

Herbert Yu
  • 111
  • 6
  • I understand what you say, but how does cookie set? Could you help me with the code ?? – Andrés May 24 '17 at 21:58
  • 2
    Add **cookies** but it still does not work, my experience with cookies is almost null thanks your help =) The code I have is the following: 'var a = "corfo.cl"; var getHtmlWeb = new HtmlWeb(); getHtmlWeb.UseCookies = true; var paginaInicio = getHtmlWeb.Load(a);' – Andrés May 25 '17 at 20:38
  • 1
    What's other differences between this HtmlAgilityPack client and firefox? Can you set a proper agent name, the same as those popular browsers? I know for sure the Firefox works, and it has multiple back and fore round trips. Use a Firefox developer to debug the step this interactive process. And then use HtmlAgilityPack to simulate this process. – Herbert Yu May 25 '17 at 22:17
  • 1
    I've made the code without htmlagilitypack, but it still does not work = ( `Uri target = new Uri("https://www.corfo.cl/"); HttpWebRequest request = (HttpWebRequest)WebRequest.Create(target); CookieContainer cookies = new CookieContainer(); cookies.Add(new Cookie("JSESSIONID", "OHS_1~LPJB4yOTbZFPxPBwWcJjJ-fPmlfhnEv_XL1MVnKSrN7hVaB-LWi7!-596930586!-316486629") { Domain = target.Host }); request.CookieContainer = cookies; HttpWebResponse response = (HttpWebResponse)request.GetResponse();` – Andrés May 25 '17 at 23:21
  • Another issue is the limitation with "Html Agility Pack". Html Agility Pack is an HTML parser only - it has no way to interpret the javascript or bind it to its internal representation of the document. See https://stackoverflow.com/questions/11393075/running-scripts-in-htmlagilitypack for possible answer to your question. I have not check if the target web site is a dynamic or not. If it is, you have to use another tool. – Herbert Yu May 25 '17 at 23:26
  • First I want to thank you for your help and patience. What tool do you recommend? Also I tried with AngleSharp, when using it the application does not release any errors, but I can not access the html. I also try to add activate and add cookies, but it does not give me any positive result. – Andrés May 26 '17 at 06:25
  • Have you used [htmlunit](http://htmlunit.sourceforge.net/)? Hope not take you too long to get that. – Herbert Yu Jun 07 '17 at 23:18