How to extract all the link name from a html page

Question

NO library...

I try to get all the link title from a webpage, the code is as following

url="http://einstein.biz/"
m = urllib.request.urlopen(url)
msg = m.read()
titleregex=re.compile('<a\s*href=[\'|"].*?[\'"].*?>(.+?)</a>')
titles = titleregex.findall(str(msg))
print(titles)

The titles are

['Photo Gallery', 'Bio', 'Quotes', 'Links', 'Contact', 'official store', '\\xe6\\x97\\xa5\\xe6\\x9c\\xac\\xe8\\xaa\\x9e', '<img\\n\\t\\tsrc="http://corbisrightsceleb.122.2O7.net/b/ss/corbisrightsceleb/1/H.14--NS/0"\\n\\t\\theight="1" width="1" border="0" alt="" />']

It is not ideal, I would like to have only as following:

['Photo Gallery', 'Bio', 'Quotes', 'Links', 'Contact', 'official store']

How to I revise my code?

Replace `(.+?)` in your re pattern with something like `([\w\s]+)` — kums, Oct 31 '14 at 07:18
It is really hard to use regexes to parse HTML code. regexes (and particularly python regexes) do not like nested structure. But [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) is a nice tool to parse HTML ... — Serge Ballesta, Oct 31 '14 at 07:19
Obligatory link to [why you shouldn't parse HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — miles82, Oct 31 '14 at 07:37
@kums thanks! You are right, it works. Do you want to write answer, then I can pick u — 3414314341, Oct 31 '14 at 07:57

score 1 · Answer 1 · answered Oct 31 '14 at 07:28

I'd definitely look into BeautifulSoup as @serge mentioned. To make it more convincing, I've included code that'll do exactly what you need.

from bs4 import BeautifulSoup
soup = BeautifulSoup(msg)          #Feed BeautifulSoup your html.
for link in soup.find_all('a'):    #Look at all the 'a' tags.
    print(link.string)             #Print out the descriptions.

returns

Photo Gallery
Bio
Quotes
Links
Contact
official store

Avinash Raj · Accepted Answer · 2014-10-31T08:10:04.750

1

You must use BeautifulSoup while dealing with html or xml files.

>>> url="http://einstein.biz/"
>>> import urllib.request
>>> m = urllib.request.urlopen(url)
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(m)
>>> s = soup.find_all('a')
>>> [i.string for i in s]
['Photo Gallery', 'Bio', 'Quotes', 'Links', 'Contact', 'official store', '日本語', None]

Update:

>>> import urllib.request
>>> url="http://einstein.biz/"
>>> m = urllib.request.urlopen(url)
>>> msg = m.read()
>>> regex = re.compile(r'(?s)<a\s*href=[\'"].*?[\'"][^<>]*>([A-Za-z][^<>]*)</a>')
>>> titles = regex.findall(str(msg))
>>> print(titles)
['Photo Gallery', 'Bio', 'Quotes', 'Links', 'Contact', 'official store']

edited Oct 31 '14 at 08:10

answered Oct 31 '14 at 07:32

Avinash Raj

172,303
28
230
274

sorry, I forget to mention no library – 3414314341 Oct 31 '14 at 07:54
@3414314341 a tag contains some unicode characters. Did you want them or not? – Avinash Raj Oct 31 '14 at 08:06

score 0 · Answer 3 · answered Oct 31 '14 at 08:02

I prefer lxml.html than BeautifulSoup, that support xpath and cssselector.

import requests
import lxml.html

res = requests.get("http://einstein.biz/")
doc = lxml.html.fromstring(res.content)
links = doc.cssselect("a")
for l in links:
    print l.text

How to extract all the link name from a html page

3 Answers3