1

I am looking to create a simple webservice to crawl webpages on specific websites and look for a person's name. Anybody know if there are any examples out there of this, or can anyone help me with the start of this?

Edit: I should mention I want to do this with Visual Studio C#. I will only be looking at English news sites that I specify.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Andy Xufuris
  • 698
  • 3
  • 9
  • 31
  • http://www.google.by/search?q=crawl+web+page+C%23&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a – Andrei Schneider Feb 09 '12 at 21:45
  • I haven't tried anything yet, i haven't found a good example for this. – Andy Xufuris Feb 09 '12 at 21:46
  • It would very based on a number of things, Language your using (what tools are available for this language), what kind of specific content you are trying to grab are two that come to mind immediately. I would recommend doing a search for Screen Scrapper – Gent Feb 09 '12 at 21:43

2 Answers2

2

Here is a simple function that returns true if a Web page contains a person's name:

string response;
using (System.Net.WebClient wc = new System.Net.WebClient())
{
    response = wc.DownloadString(url);
}  
return reponse.Contains("John Doe");

For finding the links within the page, check out this question:Parse HTML links using C#
You can collect distinct Urls throughout the site and run the code above for each Url you find.

Also, type this into Google to see what they find. site:www.somesite.com "John Doe"

Community
  • 1
  • 1
James Lawruk
  • 30,112
  • 19
  • 130
  • 137
  • Hmm between this and the Agility pack i may be able to make it click the various links on a landing page and check for the name and save that link. – Andy Xufuris Feb 09 '12 at 22:00
1

Using c# your best option for a crawler and parser (the two parts to your solution) would be to use functionality exposed by the HtmlAgility Pack, which can be found on CodePlex.

Refer to this answer for an example usage scenario: How to use HTML Agility pack

Community
  • 1
  • 1
Kane
  • 16,471
  • 11
  • 61
  • 86