Parse HTML with Python

Question

I want to create a function using Python to get the website content, for example get the website organization content.

In the code, organization is University of Tokyo:

<tr class="odd">
  <th>Organization:</th>
  <td>University of Tokyo</td>
</tr>

how can i directly get the website content without any new installation like get http://www.ip-adress.com/ip_tracer/157.123.22.11

@jesseslu Do you need to download the file? Or only parse and access it? — Rudolf Mühlbauer, Oct 11 '12 at 06:46
I think you will have a problem opening this webiste as suggested by others. added an answer to do this... — root, Oct 11 '12 at 07:26

Rudolf Mühlbauer · Answer 1 · 2012-10-11T07:10:06.120

I like BeautifulSoup, it makes it easy to access data in HTML strings. The actual complexity depends on how the HTML is formed. If the HTML uses 'id's and 'class'es, it is easy. If not, you depend on something more static, like "take the first div, the second list item, ...", which is terrible if the contents of the HTML changes a lot.

To download the HTML, i quote the example from the BeautifulSoup docs:

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
soup = BeautifulSoup(page)
for incident in soup('td', width="90%"):
    where, linebreak, what = incident.contents[:3]
    print where.strip()
    print what.strip()
    print

how can i directly get the website content without any new installation like get http://www.ip-adress.com/ip_tracer/157.123.22.11 — AntiGMO, Oct 11 '12 at 07:00

score 2 · Answer 2 · 2012-10-11T07:04:36.890

2

Use BeautifulSoup:

import bs4

html = """<tr class="odd">
  <th>Organization:</th>
  <td>University of Tokyo</td>
</tr>
"""
soup = bs4.BeautifulSoup(html)
univ = soup.tr.td.getText()
assert univ == u"University of Tokyo"

Edit:

If you need to read the HTML first, use urllib2:

import urllib2

html = urllib2.urlopen("http://example.com/").read()

edited Oct 11 '12 at 07:04

answered Oct 11 '12 at 06:50

how can i directly get the website content without any new installation like get http://www.ip-adress.com/ip_tracer/157.123.22.11 – AntiGMO Oct 11 '12 at 06:59
See my edit for how to read the contents. – Oct 11 '12 at 07:05
Don't use `urllib2`! Use `requests` instead. – avramov Oct 11 '12 at 07:42
@egasimus Requests is nice but it's not part of the Python Standard Library. – Oct 11 '12 at 07:43

root · Accepted Answer · 2012-10-11T17:09:12.287

0

You will get a 403 Access Forbidden error using urllib2.urlopen as this website is filtering access by checking if it is being accessed by a recognised user agent. So here's the full thing:

import urllib2
import lxml.html as lh

req = urllib2.Request("http://www.ip-adress.com/ip_tracer/157.123.22.11", headers={'User-Agent' : "Magic Browser"})
html = urllib2.urlopen(req).read()
doc=lh.fromstring(html)
print ''.join(doc.xpath('.//*[@class="odd"]')[-1].text_content().split())
>>> 
Organization:ZenithDataSystems

edited Oct 11 '12 at 17:09

answered Oct 11 '12 at 07:19

root

76,608
25
108
120

hi when i run it , it shows import lxml.html as lh ImportError: No module named lxml.html? – AntiGMO Oct 15 '12 at 03:10
the lxml.html stand for what? – AntiGMO Oct 15 '12 at 03:11
Thanks, after install lxml, it still has error Traceback (most recent call last): File "ext.py", line 2, in ? import lxml.html as lh File "/usr/lib64/python2.4/site-packages/lxml/html/__init__.py", line 42, in ? from lxml import etree ImportError: /usr/lib64/python2.4/site-packages/lxml/etree.so: undefined symbol: xmlMemDisplayLast – AntiGMO Oct 15 '12 at 08:25
yes, i'm using Python 2.4.3. using centos 5.5 – AntiGMO Oct 16 '12 at 01:47
yes, i'm using Python 2.4.3. using centos 5.5 – AntiGMO Oct 16 '12 at 04:14
yes, i'm using Python 2.4.3. using centos 5.5 – AntiGMO Oct 16 '12 at 06:07
Thanks, my company install python 3 in system, but they say can't remove the old version for some reasons, the new python3 in /usr/local/python3.2.3/bin/python3 so how can i run it? – AntiGMO Oct 17 '12 at 07:29
Hi, i run it using python3, but it shows [jesse@CLiMB log]$ /usr/local/python3.2.3/bin/python3 ext.py File "ext.py", line 6 print ''.join(doc.xpath('.//*[@class="odd"]')[-1].text_content().split()) ^ SyntaxError: invalid syntax – AntiGMO Oct 17 '12 at 07:57
in python 3 print is a function, use print() as print(''.join(doc.xpath('.//*[@class="odd"]')[-1].text_content().split())) – root Oct 17 '12 at 08:00
Thanks, it shows Traceback (most recent call last): File "ext.py", line 1, in import urllib2 ImportError: No module named urllib2 – AntiGMO Oct 17 '12 at 08:02
I have already change urllib2 to urllib but it shows Traceback (most recent call last): File "ext.py", line 3, in req= urllib.Requset("http:// www.ip-address.com/ip_tracer/157.123.22.11", headers={'User-Agent' : "Magic Browser"}) AttributeError: 'module' object has no attribute 'Requset' – AntiGMO Oct 17 '12 at 08:08
Thanks, but it also shows Traceback (most recent call last): File "ext.py", line 3, in req= urllib.Request("http:// www.ip-address.com/ip_tracer/157.123.22.11", headers={'User-Agent' : "Magic Browser"}) AttributeError: 'module' object has no attribute 'Request' – AntiGMO Oct 17 '12 at 08:13

Parse HTML with Python

3 Answers3