2

i need small script in python. Need to read custom block in a web file.

#!/usr/bin/python
# -*- coding: utf-8 -*-
import urllib2

req = urllib2.Request('http://target.com')
response = urllib2.urlopen(req)
the_page = response.read()
print the_page # Here is all page source with html tags, but
               # i need read only section from <head> to </head>

# example the http://target.com source is:
# <html>
# <body>
# <head>
# ... need to read this section ...
# </head>
# ... page source ...
# </body>
# </html>

How read the custom section?

moinudin
  • 134,091
  • 45
  • 190
  • 216
Ernie
  • 23
  • 3
  • 2
    Parse it using a HTML parser such as BeautifulSoup. You will also get *easy* suggestions such as doing it with a regex, but don't make it a habit. Parse it. – user225312 Dec 19 '10 at 18:10
  • You need to parse HTML/xhtml for it (if this is not a fast-cooked script to download something automatically from a site once). – khachik Dec 19 '10 at 18:10
  • @sukhbir I second that, it's infinitely more pleasurable (and generally better) to use an HTML parser. – Rafe Kettler Dec 19 '10 at 18:11
  • You should know that the `` and `` tags are both optional. – Josh Lee Dec 19 '10 at 18:55
  • @sukhbir: And here's why: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – John Dec 19 '10 at 19:30

3 Answers3

1

To parse HTML, we use a parser, such as BeautifulSoup.

Of course you can parse it using a regular expression, but that is something you should never do. Just because it works for some cases doesn't mean it is the standard way of doing it or is the proper way of doing it. If you are interested in knowing why, read this excellent answer here on SO.

Start with the BeautifulSoup tutorial and see how to parse the required information. It is pretty easy to do it. We are not going to do it for you, that is for you to read and learn!

Just to give you a heads up, you have the_page which contains the HTML data.

>> from BeautifulSoup import BeautifulSoup
>> soup = BeautifulSoup(the_page)

Now follow the tutorial and see how to get everything within the head tag.

Community
  • 1
  • 1
user225312
  • 126,773
  • 69
  • 172
  • 181
0

One solution would be to use the awesome python library Beautiful Soup. It allows you do parse the html/xml pretty easily, and will try to help out when the documents are broken or invalid.

mk.
  • 26,076
  • 13
  • 38
  • 41
0
from BeautifulSoup import BeautifulSoup
import urllib2

page = urllib2.urlopen('http://www.example.com')
soup = BeautifulSoup(page.read())
print soup.find('head')

outputs

<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Example Web Page</title>
</head>
moinudin
  • 134,091
  • 45
  • 190
  • 216