Using multiple web pages in a web scraper

Question

I've been working on some Python code to be able to get links to social media accounts from government websites, for a research into ease with which municipalities can be contacted. I've managed to adapt some code to work in 2.7, which prints all links to facebook, twitter, linkedin and google+ present on a given input website. The issue I'm currently experiencing is that I'm not looking for links on just the one web page, but on a list of about 200 websites, I have in an Excel file. I have no experience with importing these sorts of lists into Python, so I was wondering if anybody could take a look at the code, and suggest a proper way to set all these web pages as the base_url, if possible;

import cookielib

import mechanize

base_url = "http://www.amsterdam.nl"

br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.set_handle_redirect(True)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent',
              'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
page = br.open(base_url, timeout=10)

links = {}
for link in br.links():
    if link.url.find('facebook')>=0 or link.url.find('twitter')>=0 or link.url.find('linkedin')>=0 or link.url.find('plus.google')>=0:
    links[link.url] = {'count': 1, 'texts': [link.text]}

# printing
for link, data in links.iteritems():
print "%s - %s - %s - %d" % (base_url, link, ",".join(data['texts']), data['count'])

Bhargav · Accepted Answer · 2016-01-11T14:17:21.703

5

You mentioned that you have a excel file with the list of all the websites right ? Therefore you can export the excel file as a csv file which you can then read values from in your python code.

Here's some more information regarding that.

Here's how to work directly with excel files

You can do something along the lines :

import csv

links = []

with open('urls.csv', 'r') as csv_file:
    csv_reader = csv.reader(csv_file)
    # Simple example where only a single column of URL's is present
    links = list(csv_reader)

Now links is a list of all the URLs. You can then loop over the list inside a function which fetches the page and scrapes the data.

def extract_social_links(links):
    for link in links:
        base_url = link 

        br = mechanize.Browser()
        cj = cookielib.LWPCookieJar()
        br.set_cookiejar(cj)
        br.set_handle_robots(False)
        br.set_handle_equiv(False)
        br.set_handle_redirect(True)
        br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(),     max_time=1)
        br.addheaders = [('User-agent',
          'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
        page = br.open(base_url, timeout=10)

        links = {}
        for link in br.links():
            if link.url.find('facebook')>=0 or link.url.find('twitter')>=0 or     link.url.find('linkedin')>=0 or link.url.find('plus.google')>=0:
            links[link.url] = {'count': 1, 'texts': [link.text]}

        # printing
        for link, data in links.iteritems():
        print "%s - %s - %s - %d" % (base_url, link, ",".join(data['texts']), data['count'])

As an aside, you should probably split your if conditions to make them more readable.

edited Jan 11 '16 at 14:17

answered Jan 11 '16 at 10:12

Bhargav

898
1
14
32

Okay, that makes sense, to just add all links to a list, thanks. One error I've encountered with your code is given; `Traceback (most recent call last): File "C:\Users\Stefan\Documents\Research Project GRS 50806\Preexisting data\Test - kopie.py", line 11, in data.links(list(csv_reader)) NameError: name 'data' is not defined` Is there any way to define data as a valid syntax? – Stefan Jan 11 '16 at 11:25
@StefanFörch my bad, when I edited the answer I did not edit all instances of the list `links`. I'm editing it now. That line there is to add the list of links to the empty list we previously defined to be used later. – Bhargav Jan 11 '16 at 11:28
Okay, thanks very much, the code that adds the items from the CSV-file to a list seems to be working, since `print lists` gives `[['http://www.amsterdam.nl/', ' http://www.rotterdam.nl/', ' http://www.denhaag.nl/', ' http://www.utrecht.nl']]`. However, when using the entire code, it just doesn't seem to run. I'm assuming this might have something to do with the link format, however, I don't see how the document should be formatted, e.g. should quotation marks be used, should it be a list with values in a single column, etc? – Stefan Jan 11 '16 at 12:07
@StefanFörch did you call the function with the list of links ? If so then try printing the url which is being crawled in the particular iteration of the loop by adding a `print link` after the for loop invocation. – Bhargav Jan 11 '16 at 13:37
What I've done is described in the code given here; https://gist.github.com/anonymous/f8ad503f32f282cc3489 The error given is `TypeError: expected string or buffer` – Stefan Jan 11 '16 at 14:08
Judging by your previous comment, I think you need to change `for link in links:` to `for link in links[0]:` and try again. – Bhargav Jan 11 '16 at 14:11
Or you can change the line `links.extend(list(csv_reader))` to `links = list(csv_reader)` – Bhargav Jan 11 '16 at 14:11
I've changed `for link in links:` to `for link in links[0]`, and changing the number in the brackets allows me to check one specific webpage in the csv file. Would there also be a code which would allow me to check all webpages at once, or at least, without having to change the code in between? – Stefan Jan 12 '16 at 18:49
Have a look at this https://gist.github.com/bIgBV/30ec444f5d7cf45bc616 , the thing is, make sure that the list `links` actually contains all the links in itself, instead of it being nested list `[['url1', 'url2'....` . What you need to pass to the function `extract_social_links` is a list of links, not a list of a list of links. – Bhargav Jan 12 '16 at 18:53

Using multiple web pages in a web scraper

1 Answers1