Read HEAD contents from HTML

Question

i need small script in python. Need to read custom block in a web file.

#!/usr/bin/python
# -*- coding: utf-8 -*-
import urllib2

req = urllib2.Request('http://target.com')
response = urllib2.urlopen(req)
the_page = response.read()
print the_page # Here is all page source with html tags, but
               # i need read only section from <head> to </head>

# example the http://target.com source is:
# <html>
# <body>
# <head>
# ... need to read this section ...
# </head>
# ... page source ...
# </body>
# </html>

How read the custom section?

Parse it using a HTML parser such as BeautifulSoup. You will also get *easy* suggestions such as doing it with a regex, but don't make it a habit. Parse it. — user225312, Dec 19 '10 at 18:10
You need to parse HTML/xhtml for it (if this is not a fast-cooked script to download something automatically from a site once). — khachik, Dec 19 '10 at 18:10
@sukhbir I second that, it's infinitely more pleasurable (and generally better) to use an HTML parser. — Rafe Kettler, Dec 19 '10 at 18:11
@sukhbir: And here's why: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — John, Dec 19 '10 at 19:30

score 1 · Answer 1 · edited May 23 '17 at 12:18

To parse HTML, we use a parser, such as BeautifulSoup.

Of course you can parse it using a regular expression, but that is something you should never do. Just because it works for some cases doesn't mean it is the standard way of doing it or is the proper way of doing it. If you are interested in knowing why, read this excellent answer here on SO.

Start with the BeautifulSoup tutorial and see how to parse the required information. It is pretty easy to do it. We are not going to do it for you, that is for you to read and learn!

Just to give you a heads up, you have the_page which contains the HTML data.

>> from BeautifulSoup import BeautifulSoup
>> soup = BeautifulSoup(the_page)

Now follow the tutorial and see how to get everything within the head tag.

score 0 · Answer 2 · answered Dec 19 '10 at 18:28

0

One solution would be to use the awesome python library Beautiful Soup. It allows you do parse the html/xml pretty easily, and will try to help out when the documents are broken or invalid.

answered Dec 19 '10 at 18:28

mk.

26,076
13
38
41

score 0 · Accepted Answer · answered Dec 19 '10 at 18:45

0

from BeautifulSoup import BeautifulSoup
import urllib2

page = urllib2.urlopen('http://www.example.com')
soup = BeautifulSoup(page.read())
print soup.find('head')

outputs

<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Example Web Page</title>
</head>

answered Dec 19 '10 at 18:45

moinudin

134,091
45
190
216

@Ernie Define what you mean by custom text? – moinudin Dec 19 '10 at 18:50
if i need read text from "iv="Cont" to "charset=utf", not from html tags – Ernie Dec 19 '10 at 18:53
@Ernie That's very different, you can do it with regular expressions quite easily. `re.search(r'"iv="Cont"(.*)"charset=utf"', text).group(1)` – moinudin Dec 19 '10 at 18:56
@Ernie I think you can figure that out from what you've got. – moinudin Dec 19 '10 at 18:59
NameError: name 're' is not defined – Ernie Dec 19 '10 at 19:01
Try removing `import urllib2` from your code sample and rerun. What does this experiment suggest might be the solution to your `name 're' is not defined` error? – PaulMcG Dec 20 '10 at 08:26

Read HEAD contents from HTML

3 Answers3