5

I'm looking into making a web crawler/spider but I need someone to point me in the right direction to get started.

Basically, my spider is going to search for audio files and index them.

I'm just wondering if anyone has any ideas for how I should do it. I've heard having it done in PHP would be extremely slow. I know vb.net so could that come in handy?

I was thinking about using Googles filetype search to get links to crawl. Would that be ok?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Belgin Fish
  • 19,187
  • 41
  • 102
  • 131

3 Answers3

2

Here is a link on a tutorial on how to write a web crawler in java. http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/ I'm sure if you google it you can find ones for other languages.

qw3n
  • 6,236
  • 6
  • 33
  • 62
  • If you make a web crawler in java does it have to be server side? Right now I'm on shared hosting which doesn't allow java and I currently cannot afford to get a dedicated or vps. – Belgin Fish Jul 09 '10 at 03:04
  • No this could be run on your home computer if you wanted to. – qw3n Jul 09 '10 at 03:46
2

In VB.NET you will need to get the HTML first, so use the WebClient class or HttpWebRequest and HttpWebResponse classes. There is plenty of info on how to use these on the interweb.

Then you will need to parse the HTML. I recommend using regular expressions for this.

Your idea of using Google for a filetype search is a good one. I did a similar thing a few years ago to gather PDFs to test PDF indexing in SharePoint, which worked really well.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Chris Diver
  • 19,362
  • 4
  • 47
  • 58
  • Thanks, any idea how I could insert data into my database form a desktop vb app? – Belgin Fish Jul 09 '10 at 03:08
  • Depends on the flavor of database. There is the `System.Data.SqlClient` namespace for SQL Server. For anything else you will need to look at `System.Data.OleDb` namespace. It's better to use a console VB app if you want this to run unattended – Chris Diver Jul 09 '10 at 03:14
  • 2
    In regards to parsing HTML with regex....http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags Probably the single best SO answer ever. – rfusca Jul 09 '10 at 03:40
  • Thanks for the link, i suppose 'parse' was the wrong word to choose, he will just extract all relevant hyperlinks from the page, doesn't matter about the structure of the HTML. – Chris Diver Jul 09 '10 at 04:03
0

The pseudo code should be like:

Method spider(URL startURL){ 
 Collection URLStore; // Can be an arraylist  
    push(startURL,URLStore);// start with a know url
       while URLStore ! Empty do 
         currURL= pop(URLStore); //take an url
         download URL page;
        push (URLx, URLStore); //for all links to URL in the page which are not already followed, then put in the list

To read some data from a web page in Java you can do:

URL myURL = new URL("http://www.w3.org"); 
 BufferedReader in =  new BufferedReader( new InputStreamReader(myURL.openStream())); 
 String inputLine; 
 while ((inputLine = in.readLine()) != null) //you will get all content of the page
 System.out.println(inputLine); //  here you need to extract the hyperlinks
 in.close();
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Memin
  • 3,788
  • 30
  • 31