-1

NO library...

I try to get all the link title from a webpage, the code is as following

url="http://einstein.biz/"
m = urllib.request.urlopen(url)
msg = m.read()
titleregex=re.compile('<a\s*href=[\'|"].*?[\'"].*?>(.+?)</a>')
titles = titleregex.findall(str(msg))
print(titles)

The titles are

['Photo Gallery', 'Bio', 'Quotes', 'Links', 'Contact', 'official store', '\\xe6\\x97\\xa5\\xe6\\x9c\\xac\\xe8\\xaa\\x9e', '<img\\n\\t\\tsrc="http://corbisrightsceleb.122.2O7.net/b/ss/corbisrightsceleb/1/H.14--NS/0"\\n\\t\\theight="1" width="1" border="0" alt="" />']

It is not ideal, I would like to have only as following:

['Photo Gallery', 'Bio', 'Quotes', 'Links', 'Contact', 'official store']

How to I revise my code?

Cœur
  • 37,241
  • 25
  • 195
  • 267
  • Replace `(.+?)` in your re pattern with something like `([\w\s]+)` – kums Oct 31 '14 at 07:18
  • 2
    It is really hard to use regexes to parse HTML code. regexes (and particularly python regexes) do not like nested structure. But [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) is a nice tool to parse HTML ... – Serge Ballesta Oct 31 '14 at 07:19
  • 2
    Obligatory link to [why you shouldn't parse HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – miles82 Oct 31 '14 at 07:37
  • @kums thanks! You are right, it works. Do you want to write answer, then I can pick u – 3414314341 Oct 31 '14 at 07:57

3 Answers3

1

I'd definitely look into BeautifulSoup as @serge mentioned. To make it more convincing, I've included code that'll do exactly what you need.

from bs4 import BeautifulSoup
soup = BeautifulSoup(msg)          #Feed BeautifulSoup your html.
for link in soup.find_all('a'):    #Look at all the 'a' tags.
    print(link.string)             #Print out the descriptions.

returns

Photo Gallery
Bio
Quotes
Links
Contact
official store
Rohit
  • 3,087
  • 3
  • 20
  • 29
1

You must use BeautifulSoup while dealing with html or xml files.

>>> url="http://einstein.biz/"
>>> import urllib.request
>>> m = urllib.request.urlopen(url)
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(m)
>>> s = soup.find_all('a')
>>> [i.string for i in s]
['Photo Gallery', 'Bio', 'Quotes', 'Links', 'Contact', 'official store', '日本語', None]

Update:

>>> import urllib.request
>>> url="http://einstein.biz/"
>>> m = urllib.request.urlopen(url)
>>> msg = m.read()
>>> regex = re.compile(r'(?s)<a\s*href=[\'"].*?[\'"][^<>]*>([A-Za-z][^<>]*)</a>')
>>> titles = regex.findall(str(msg))
>>> print(titles)
['Photo Gallery', 'Bio', 'Quotes', 'Links', 'Contact', 'official store']
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
0

I prefer lxml.html than BeautifulSoup, that support xpath and cssselector.

import requests
import lxml.html

res = requests.get("http://einstein.biz/")
doc = lxml.html.fromstring(res.content)
links = doc.cssselect("a")
for l in links:
    print l.text
shoma
  • 618
  • 5
  • 9