scraping HTML in Python

Question

I'm trying to find a series of URLs (twitter links) from the source of a page and then put them into a list in a text document. The problem I have is that once I .readlines() the urlopen object, I have a grand total of 3-4 lines each consisting of dozens of urls that I need to collect one-by-one. This is the snippet of my code where I try to rectify this:

page = html.readlines()
for line in page:
       ind_start = line.find('twitter')
       ind_end = line.find('</a>', ind_start+1)
       while ('twitter' in line[ind_start:ind_end]):
           output.write(line[ind_start:ind_end] + "\n")
           ind_start = line.find('twitter', ind_start)
           ind_end  = line.find('</a>', ind_start + 1)

Unfortunately I can't extract any urls using this. Any advice?

Oh no! Text scraping is like Perl in 1995! (Read: don't do it.) Instead, use a modeling agent for the domain - in this case, HTML/DOM. I have heard of things like [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/) .. — , Jan 27 '13 at 20:51

score 3 · Answer 1 · edited May 23 '17 at 12:28

3

You can extract links using lxml and a xpath expression :

from lxml.html import parse

p = parse('http://domain.tld/path')
for link in p.xpath('.//a/@href'):
    if "twitter" in link:
        print link, "match 'twitter'"

Using regex there, is not the better way : parsing HTML is a solved problem in 2013. See RegEx match open tags except XHTML self-contained tags

edited May 23 '17 at 12:28

Community

1
1

answered Jan 27 '13 at 21:22

Gilles Quénot

173,512
41
224
223

That's why I edited my answer to include BeautifulSoup solution :) - it's clear, easy to understand, and to use it, you only need to know name of one method - findAll ;) – Marek M. Jan 27 '13 at 21:40
When you learn Xpath, you learn (IMHO) the better query language both for HTML and XML. This knowledge will serves you in other languages too. Moreover, Xpath is easy for simple query like this. Let search google (here or not) and you will find tons of examples/tutorials. – Gilles Quénot Jan 28 '13 at 07:50

Marek M. · Answer 2 · 2013-01-28T16:18:59.987

2

You could use the BeautifulSoup module:

from bs4 import BeautifulSoup

soup = BeautifulSoup('your html')
elements = soup.findAll('a')

for el in elements:
    print el['href']

If not - just use regexp:

import re

expression = re.compile(r'http:\/\/*')
m = expression.search('your string')

if m:
    print 'match found!'

This would match also the urls within <img /> tags, but you can tweak my solution easily to only find urls within <a /> tags

edited Jan 28 '13 at 16:18

answered Jan 27 '13 at 20:46

Marek M.

3,799
9
43
93

Consider placing the "best-practice" answer first or otherwise emphasizing it. (Regular expressions are little better than the original code.) – Jan 28 '13 at 00:03

scraping HTML in Python

2 Answers2