how to get all the urls of a website using a crawler or a scraper?

Question

i have to get many urls from a website and then i've to copy these in an excel file. I'm looking for an automatic way to do that. The website is structured having a main page with about 300 links and inside of each link there are 2 or 3 links that are interesting for me. Any suggestions ?

Are you interested in programming, and if yes, what language? — piokuc, Jul 31 '13 at 08:39

score 1 · Answer 1 · edited May 23 '17 at 10:31

1

If you want to develop your solution in Python then I can recommend Scrapy framework.

As far as inserting the data into an Excel sheet is concerned, there are ways to do it directly, see for example here: Insert row into Excel spreadsheet using openpyxl in Python , but you can also write the data into a CSV file and then import it into Excel.

edited May 23 '17 at 10:31

Community

1
1

answered Jul 31 '13 at 09:24

piokuc

25,594
11
72
102

score 1 · Answer 2 · answered Jan 02 '14 at 20:33

If the links are in the html... You can use beautiful soup. This has worked for me in the past.

import urllib2
from bs4 import BeautifulSoup

page = 'http://yourUrl.com'
opened = urllib2.urlopen(page)
soup = BeautifulSoup(opened)

for link in soup.find_all('a'):
    print (link.get('href'))

score 0 · Answer 3 · answered Jul 31 '13 at 11:17

0

have you tried selenium or urllib?.urllib is faster than selenium http://useful-snippets.blogspot.in/2012/02/simple-website-crawler-with-selenium.html

answered Jul 31 '13 at 11:17

Srinivasreddy Jakkireddy

2,581
1
11
7

Abhishek · Answer 4 · 2014-01-04T11:55:03.470

You can use beautiful soup for parsing , [http://www.crummy.com/software/BeautifulSoup/]

More information about docs here http://www.crummy.com/software/BeautifulSoup/bs4/doc/

I won't suggest scrappy because you don't need that for work you described in your question.

For e.g. this code will use urllib2 library to open a google homepage and find all links in that output in the form of list

import urllib2
from bs4 import BeautifulSoup

data=urllib2.urlopen('http://www.google.com').read()
soup=BeautifulSoup(data)
print soup.find_all('a')

For handling excel files take a look at http://www.python-excel.org

how to get all the urls of a website using a crawler or a scraper?

4 Answers4