0

i have to get many urls from a website and then i've to copy these in an excel file. I'm looking for an automatic way to do that. The website is structured having a main page with about 300 links and inside of each link there are 2 or 3 links that are interesting for me. Any suggestions ?

piokuc
  • 25,594
  • 11
  • 72
  • 102
giogix
  • 769
  • 1
  • 12
  • 32

4 Answers4

1

If you want to develop your solution in Python then I can recommend Scrapy framework.

As far as inserting the data into an Excel sheet is concerned, there are ways to do it directly, see for example here: Insert row into Excel spreadsheet using openpyxl in Python , but you can also write the data into a CSV file and then import it into Excel.

Community
  • 1
  • 1
piokuc
  • 25,594
  • 11
  • 72
  • 102
1

If the links are in the html... You can use beautiful soup. This has worked for me in the past.

import urllib2
from bs4 import BeautifulSoup

page = 'http://yourUrl.com'
opened = urllib2.urlopen(page)
soup = BeautifulSoup(opened)

for link in soup.find_all('a'):
    print (link.get('href'))
MinimalMaximizer
  • 392
  • 1
  • 4
  • 18
0

have you tried selenium or urllib?.urllib is faster than selenium http://useful-snippets.blogspot.in/2012/02/simple-website-crawler-with-selenium.html

0

You can use beautiful soup for parsing , [http://www.crummy.com/software/BeautifulSoup/]

More information about docs here http://www.crummy.com/software/BeautifulSoup/bs4/doc/

I won't suggest scrappy because you don't need that for work you described in your question.

For e.g. this code will use urllib2 library to open a google homepage and find all links in that output in the form of list

import urllib2
from bs4 import BeautifulSoup

data=urllib2.urlopen('http://www.google.com').read()
soup=BeautifulSoup(data)
print soup.find_all('a')

For handling excel files take a look at http://www.python-excel.org

Abhishek
  • 5,649
  • 3
  • 23
  • 42