How to get the content of a Html page in Python

Question

I have downloaded the web page into an html file. I am wondering what's the simplest way to get the content of that page. By content, I mean I need the strings that a browser would display.

To be clear:

Input:

<html><head><title>Page title</title></head>
       <body><p id="firstpara" align="center">This is paragraph <b>one</b>.
       <p id="secondpara" align="blah">This is paragraph <b>two</b>.
       </html>

Output:

Page title This is paragraph one. This is paragraph two.

putting together:

from BeautifulSoup import BeautifulSoup
import re

def removeHtmlTags(page):
    p = re.compile(r'''<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>''')
    return p.sub('', page)

def removeHtmlTags2(page):
    soup = BeautifulSoup(page)
    return ''.join(soup.findAll(text=True))

Python HTML removal
Extracting text from HTML file using Python
What is a light python library that can eliminate HTML tags? (and only text)
Remove HTML tags in AppEngine Python Env (equivalent to Ruby’s Sanitize)
RegEx match open tags except XHTML self-contained tags (famous don't use regex to parse html rant)

which tags are you talking about? could you be more specific? — SilentGhost, Mar 10 '10 at 12:59
are tags. I don't want them I need a actual string that displays on a browser. — Yin Zhu, Mar 10 '10 at 13:03

Oddthinking · Accepted Answer · 2010-03-10T13:51:06.197

12

Parse the HTML with Beautiful Soup.

To get all the text, without the tags, try:

''.join(soup.findAll(text=True))

edited Mar 10 '10 at 13:51

answered Mar 10 '10 at 12:35

Oddthinking

24,359
19
83
121

http://www.crummy.com/software/BeautifulSoup/documentation.html I don't see renderContents() function works here. I want to delete the tags. – Yin Zhu Mar 10 '10 at 12:47
@Yin Zhu - Ah, renderContents works on sub-parts, not the whole document. I replaced the technique with one snipped from the documentation. – Oddthinking Mar 10 '10 at 13:52
@Yin Zhu: renderContents occurs 6 times in the referenced documentation. Please use a web browser that supports page search. – S.Lott Mar 10 '10 at 13:52

the Tin Man · Answer 2 · 2013-03-05T17:36:12.330

Personally, I use lxml because it's a swiss-army knife...

from lxml import html

print html.parse('http://someurl.at.domain').xpath('//body')[0].text_content()

This tells lxml to retrieve the page, locate the <body> tag then extract and print all the text.

I do a lot of page parsing and a regex is the wrong solution most of the time, unless it's a one-time-only need. If the author of the page changes their HTML you run a good risk of your regex breaking. A parser is a lot more likely to continue working.

The big problem with a parser is learning how to access the sections of the document you are after, but there are a lot of XPATH tools you can use inside your browser that simplify the task.

score 2 · Answer 3 · edited Aug 03 '12 at 14:19

2

You want to look at Extracting data from HTML documents - Dive into Python because HERE it does (almost)exactly what you want.

edited Aug 03 '12 at 14:19

Bill the Lizard

398,270
210
566
880

answered Mar 10 '10 at 13:15

Pratik Deoghare

35,497
30
100
146

1

Links are broken – winklerrr Oct 16 '18 at 08:40

score 1 · Answer 4 · answered Mar 10 '10 at 12:49

1

The best modules for this task are lxml or html5lib; Beautifull Soap is imho not worth to use anymore. And for recursive models regular expressions are definitly the wrong method.

answered Mar 10 '10 at 12:49

Christian Hausknecht

85
1

Would you like to explain why Beautiful Soup is no longer worth using? – Oddthinking Mar 10 '10 at 13:52
1

Seconded. What changed about HTML that made Beautiful Soup irrelevant? It abstracts away a lot of the issues with imperfect HTML. – Tom Mar 10 '10 at 13:59

score -2 · Answer 5 · answered Mar 10 '10 at 12:46

-2

If I am getting your question correctly, this can simply be done by using urlopen function of urllib. Just have a look at this function to open an url and read the response which will be the html code of that page.

answered Mar 10 '10 at 12:46

Ankit

1

you're not getting it right, OP says: *I have downloaded the web page into an html file.* – SilentGhost Mar 10 '10 at 12:49

score -3 · Answer 6 · answered Mar 10 '10 at 12:34

-3

The quickest way to get a usable sample of what a browser would display is to remove any tags from the html and print the rest. This can, for example, be done using python's re.

answered Mar 10 '10 at 12:34

Alexander Gessler

45,603
7
82
122

3

this cannot be done using regex. please, don't confuse people. – SilentGhost Mar 10 '10 at 12:36
Please explain. I'm not talking about a perfect solution, just about a rough way to get the contents in acceptable quality (I'm aware that the approach is limited). Removing tags is just looking for `<..> and `, so why exactly is it not possible using regexes? – Alexander Gessler Mar 10 '10 at 12:50
2

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – SilentGhost Mar 10 '10 at 12:56
"Now you have two problems" http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html – Oddthinking Mar 10 '10 at 14:04

How to get the content of a Html page in Python

Related

6 Answers6