Scraping Multiple Sites using C#

Question

I'm new in C# and data scraping and I'm not sure what should I do. I was planning to search some keywords in google then get the title and description and url of those in search results then use the url in seocheki.net then extract the data too. How should I do it?

I still don't know what to do to extract google search result yet so I tried to to get the data in seocheki.

I tried to use HTMLAgilityPack to get the data in seocheki

    private async Task<List<Seocheki>> ResultFromSeocheki(int pageNum)
    {


        string url = "http://seocheki.net/site-check.php?u=http%3A%2F%2Fwww.gamerankings.com%2Fbrowse.html";
        var doc = await Task.Factory.StartNew(() => web.Load(url));
        var titleNodes = doc.DocumentNode.SelectNodes("//*[@id=\"td-title\"]");
        var descNodes = doc.DocumentNode.SelectNodes("//*[@id=\"td-desc\"]");
        var keywordNodes = doc.DocumentNode.SelectNodes("//*[@id=\"td-kw\"]");
        var h1Nodes = doc.DocumentNode.SelectNodes("//*[@id=\"td-h1\"]");


        var title = titleNodes.Select(node => node.InnerText).ToList();
        var desc = descNodes.Select(node => node.InnerText).ToList();
        var keyword = keywordNodes.Select(node => node.InnerText).ToList();
        var h1 = h1Nodes.Select(node => node.LastChild.InnerText).ToList();

    }

but this is the result

enter image description here

How to scrape the data? HTMLAgilityPack doesn't seem to work to me.

score 0 · Accepted Answer · edited May 23 '17 at 12:08

Take a look at this answer: Running Scripts in HtmlAgilityPack

Basically HTMLAgilityPack is a HTML engine, if you view source on http://seocheki.net/site-check.php?u=http%3A%2F%2Fwww.gamerankings.com%2Fbrowse.html you will see the following section:

...
<th class="pelem">title</th><td id="td-title" colspan="3">&nbsp;</td>
</tr>
<tr>
<th class="pelem">description</th><td id="td-desc" colspan="3">&nbsp;</td>
</tr>
<tr>
<th class="pelem">keywords</th><td id="td-kw" colspan="3">&nbsp;</td>
</tr>
...

You can see that the HTML contents of those items is indeed ' '. The actual text content you see on the webpage is being injected via javascript, and as such HTMLAgilityPack is not going to get it for you, see the link at the top of this answer for more details.

However a quick look at the site's javascript shows these fields are being populated from a call to http://seocheki.net/get/get-siteinfo.php?url=http%3A%2F%2Fwww.gamerankings.com%2Fbrowse.html

Which gives back the following JSON:

{"title":" Reviews and News Articles - GameRankings","desc":"GameRankings browse/search engine shows games review scores from around the net.","kw":"","h1":"Browse and Search Games","inlink":"68","outlink":"6","lastm":"-","fsize":"54.6KB","ttime":"0.996秒"}

So you can query and use that JSON directly*

*I am making no claim either way that it is legal/appropriate to do so.

I see. Thank you for your help! – Blake Nov 30 '16 at 04:02 — Blake, Nov 30 '16 at 04:02

Scraping Multiple Sites using C#

1 Answers1