Dynamic web scraping with C#

Question

I am trying to scrape the web page with C# and I am using HtmlAgilityPack it works good for me, but I got an issue with this website when I need to scrape data from another page of product list. Because link doesn't have page number so I cannot access it by changing link. I found out that page is changed by javascript "__doPostBack" function which doesn't changes the link, just reloads the page, and loads the data. This is my code for scraping code and price of the product in this web site, however there are more products in other page e.g 2, 3, 4, 5... I need to scrape data from all of these. On other websites I can do just simply passing link to web.Load("Link"); and it works well because link is changing when you change page of product list. In this example link is not changing when other page of the list is selected.

  public class CodeAndPrice
{
    public string Code { get; set; }
    public string Price { get; set; }
}
public partial class Form1 : Form
{
    DataTable table;
    HtmlWeb web = new HtmlWeb();
    public Form1()
    {
        InitializeComponent();
        InitTable();
    }
    private void InitTable()
    {
        table = new DataTable("DataTableTest");
        table.Columns.Add("Code", typeof(string));
        table.Columns.Add("Price", typeof(string));
        dataGridView.DataSource = table;
    }
    private async Task<List<CodeAndPrice>> DataScraping (){

        var page = await Task.Factory.StartNew(() => web.Load("https://www.kilobaitas.lt/Kompiuteriai/Plansetiniai_(Tablet)/CatalogStore.aspx?CatID=PL_626"));

        var codesNodes = page.DocumentNode.SelectNodes("//td[@class='mainContent']//div[@class='itemNormal']//div[@class='itemCode']");
        var pricesNodes = page.DocumentNode.SelectNodes("//td[@class='mainContent']//div[@class='itemNormal']//div[@class='itemCode']//parent::div//div[@class='itemBoxPrice']");
        if (codesNodes == null || pricesNodes == null)
            return new List<CodeAndPrice>();

        var codes = codesNodes.Select(node => node.InnerText.Replace("kodas", "").Replace(" ", "").Replace(":&nbsp;", ""));
        var prices = pricesNodes.Select(node => node.InnerText.Replace(" ", "").Replace("&nbsp;€", ""));

        return codes.Zip(prices, (code,price)=> new CodeAndPrice() { Code = code, Price = price }).ToList();
    }
    private async void Form1_Load(object sender, EventArgs e)
    {
        var results = await DataScraping();
        foreach (var rez in results) {
            table.Rows.Add(rez.Code, rez.Price);
    }
    }


}

Passing __doPostBack('designer1$ctl11$ctl00$MainCatalogSquare1$XDataPaging1','paging.1'); into the browser's console, page 2 is loaded, by changing "paging.*", browser loads page *+1

What is the simplest way to manipulate javascript, that I will be able to change page while scraping data and scrape data from other pages of this website?

It is really unclear what it is you are asking here. Can you post an example of your code - i.e. what you have tried and why is doesn't work. — Fraser, Mar 12 '18 at 18:44
Looking at your image, I think what you are saying is that when you change pages, the URL doesn't change. This is because it's probably using some kind of web service call to get more products, then updates the client side appropriately. You'll probably have to use something such as Fiddler to see the web request. — Icemanind, Mar 12 '18 at 19:04
When you execute the __doPostBack... does the document node change/load the new information for the next "page"? Like IceManind says, it is probably performing a call to a web server and returning a data that is used to recreate the html in the section with the products. I am not sure if the page.DocumentNode changes when that occurs or if you have to do somekind of refresh and then execute your nodes.select again. — Keith Aymar, Mar 12 '18 at 21:34
When __doPostBack is executed in browser's console, it refreshes the page and loads new html. However, maybe it is possible somehow to execute javascript in my C# code that effects web that I am scraping? — Dov95, Mar 13 '18 at 07:04
I have a product that does scrapping on ecommerce site such as yours. I use cefsharp and does scrapping through injecting JS that returns values that I required. If you want to get a dom that generated on the fly... first u have to find in which tag that new DOM appended to. After that you can get the new DOM by querying the innerHTML of that appended tag. I hope this work. — Bromo Programmer, Aug 09 '18 at 04:35
I figured this out by using selenium. I used it for automated web browsing. Made method, which is switching pages and gets information from new page. It's kind a slow way, but it was ok for me, since I only scrape this web site one time a day during the night. — Dov95, Aug 10 '18 at 06:57

Dynamic web scraping with C#

0 Answers0

Linked