0

I want to create a function using Python to get the website content, for example get the website organization content.

In the code, organization is University of Tokyo:

<tr class="odd">
  <th>Organization:</th>
  <td>University of Tokyo</td>
</tr>

how can i directly get the website content without any new installation like get http://www.ip-adress.com/ip_tracer/157.123.22.11

AntiGMO
  • 1,535
  • 5
  • 23
  • 38

3 Answers3

3

I like BeautifulSoup, it makes it easy to access data in HTML strings. The actual complexity depends on how the HTML is formed. If the HTML uses 'id's and 'class'es, it is easy. If not, you depend on something more static, like "take the first div, the second list item, ...", which is terrible if the contents of the HTML changes a lot.

To download the HTML, i quote the example from the BeautifulSoup docs:

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
soup = BeautifulSoup(page)
for incident in soup('td', width="90%"):
    where, linebreak, what = incident.contents[:3]
    print where.strip()
    print what.strip()
    print
Rudolf Mühlbauer
  • 2,511
  • 16
  • 18
  • how can i directly get the website content without any new installation like get http://www.ip-adress.com/ip_tracer/157.123.22.11 – AntiGMO Oct 11 '12 at 07:00
2

Use BeautifulSoup:

import bs4

html = """<tr class="odd">
  <th>Organization:</th>
  <td>University of Tokyo</td>
</tr>
"""
soup = bs4.BeautifulSoup(html)
univ = soup.tr.td.getText()
assert univ == u"University of Tokyo"

Edit:

If you need to read the HTML first, use urllib2:

import urllib2

html = urllib2.urlopen("http://example.com/").read()
0

You will get a 403 Access Forbidden error using urllib2.urlopen as this website is filtering access by checking if it is being accessed by a recognised user agent. So here's the full thing:

import urllib2
import lxml.html as lh

req = urllib2.Request("http://www.ip-adress.com/ip_tracer/157.123.22.11", headers={'User-Agent' : "Magic Browser"})
html = urllib2.urlopen(req).read()
doc=lh.fromstring(html)
print ''.join(doc.xpath('.//*[@class="odd"]')[-1].text_content().split())
>>> 
Organization:ZenithDataSystems
root
  • 76,608
  • 25
  • 108
  • 120
  • hi when i run it , it shows import lxml.html as lh ImportError: No module named lxml.html? – AntiGMO Oct 15 '12 at 03:10
  • the lxml.html stand for what? – AntiGMO Oct 15 '12 at 03:11
  • Thanks, after install lxml, it still has error Traceback (most recent call last): File "ext.py", line 2, in ? import lxml.html as lh File "/usr/lib64/python2.4/site-packages/lxml/html/__init__.py", line 42, in ? from lxml import etree ImportError: /usr/lib64/python2.4/site-packages/lxml/etree.so: undefined symbol: xmlMemDisplayLast – AntiGMO Oct 15 '12 at 08:25
  • yes, i'm using Python 2.4.3. using centos 5.5 – AntiGMO Oct 16 '12 at 01:47
  • yes, i'm using Python 2.4.3. using centos 5.5 – AntiGMO Oct 16 '12 at 04:14
  • yes, i'm using Python 2.4.3. using centos 5.5 – AntiGMO Oct 16 '12 at 06:07
  • Thanks, my company install python 3 in system, but they say can't remove the old version for some reasons, the new python3 in /usr/local/python3.2.3/bin/python3 so how can i run it? – AntiGMO Oct 17 '12 at 07:29
  • Hi, i run it using python3, but it shows [jesse@CLiMB log]$ /usr/local/python3.2.3/bin/python3 ext.py File "ext.py", line 6 print ''.join(doc.xpath('.//*[@class="odd"]')[-1].text_content().split()) ^ SyntaxError: invalid syntax – AntiGMO Oct 17 '12 at 07:57
  • in python 3 print is a function, use print() as print(''.join(doc.xpath('.//*[@class="odd"]')[-1].text_content().split())) – root Oct 17 '12 at 08:00
  • Thanks, it shows Traceback (most recent call last): File "ext.py", line 1, in import urllib2 ImportError: No module named urllib2 – AntiGMO Oct 17 '12 at 08:02
  • I have already change urllib2 to urllib but it shows Traceback (most recent call last): File "ext.py", line 3, in req= urllib.Requset("http:// www.ip-address.com/ip_tracer/157.123.22.11", headers={'User-Agent' : "Magic Browser"}) AttributeError: 'module' object has no attribute 'Requset' – AntiGMO Oct 17 '12 at 08:08
  • Thanks, but it also shows Traceback (most recent call last): File "ext.py", line 3, in req= urllib.Request("http:// www.ip-address.com/ip_tracer/157.123.22.11", headers={'User-Agent' : "Magic Browser"}) AttributeError: 'module' object has no attribute 'Request' – AntiGMO Oct 17 '12 at 08:13