I've been working on some Python code to be able to get links to social media accounts from government websites, for a research into ease with which municipalities can be contacted. I've managed to adapt some code to work in 2.7, which prints all links to facebook, twitter, linkedin and google+ present on a given input website. The issue I'm currently experiencing is that I'm not looking for links on just the one web page, but on a list of about 200 websites, I have in an Excel file. I have no experience with importing these sorts of lists into Python, so I was wondering if anybody could take a look at the code, and suggest a proper way to set all these web pages as the base_url, if possible;
import cookielib
import mechanize
base_url = "http://www.amsterdam.nl"
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.set_handle_redirect(True)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent',
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
page = br.open(base_url, timeout=10)
links = {}
for link in br.links():
if link.url.find('facebook')>=0 or link.url.find('twitter')>=0 or link.url.find('linkedin')>=0 or link.url.find('plus.google')>=0:
links[link.url] = {'count': 1, 'texts': [link.text]}
# printing
for link, data in links.iteritems():
print "%s - %s - %s - %d" % (base_url, link, ",".join(data['texts']), data['count'])