Regex for extracting all regular text from html in python

Question

how do i extract everythin that is not an html tag from a partial html text?

That is, if I have something of the type:

<div>Hello</div><h3><div>world</div></h3>

I want to extract ['Hello','world']

I thought about the Regex:

>[a-zA-Z0-9]+<

but it will not include special characters and chinese or hebrew characters, which I need

You don't. Do not use `regex` for HTML. Use an (X)HTML Parser like BeautifulSoup. — g.d.d.c, Feb 07 '13 at 19:19
If you want to strip html tags : http://stackoverflow.com/questions/3662142/how-to-remove-tags-from-a-string-in-python-using-regular-expressions-not-in-ht — Fabien Sa, Feb 07 '13 at 19:20
The key question is how complex is your html. Are there container tags like ` — georg, Feb 07 '13 at 19:38

score 3 · Answer 1 · edited May 23 '17 at 12:18

You should look at something like regular expression to extract text from HTML

From that post:

You can't really parse HTML with regular expressions. It's too complex. RE's won't handle will work in a browser as proper text, but might baffle a naive RE.

You'll be happier and more successful with a proper HTML parser. Python folks often use something Beautiful Soup to parse HTML and strip out tags and scripts.

Also, browsers, by design, tolerate malformed HTML. So you will often find yourself trying to parse HTML which is clearly improper, but happens to work okay in a browser.

You might be able to parse bad HTML with RE's. All it requires is patience and hard work. But it's often simpler to use someone else's parser.

also http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Guillaume Algis, Feb 07 '13 at 19:19

piokuc · Answer 2 · 2013-02-07T19:43:00.500

1

As Avi already pointed, this is too complex task for regular expressions. Use get_text from BeautifulSoup or clean_html from nltk to extract text from your html.

from bs4 import BeautifulSoup
clean_text = BeautifulSoup(html).get_text()

or

import nltk
clean_text = nltk.clean_html(html)

Another option, thanks to GuillaumeA, is to use pyquery:

from pyquery import PyQuery
clean_text = PyQuery(html)

It must be said that the above mentioned html parsers will do the job with varying level of success if the html is not well formed, so you should experiment and see what works best for your input data.

edited Feb 07 '13 at 19:43

answered Feb 07 '13 at 19:20

piokuc

25,594
11
72
102

I'd personnaly recommend [pyquery](http://packages.python.org/pyquery/). It is easy to use and blazing fast. – Guillaume Algis Feb 07 '13 at 19:23
How exactly would you extract the text from html using pyquery? – piokuc Feb 07 '13 at 19:30
`>>> from pyquery import PyQuery` `>>> d = PyQuery('
Hello
world
')` `>>> d.text()` `'Hello world'` – Guillaume Algis Feb 07 '13 at 19:36

score -1 · Answer 3 · answered Feb 07 '13 at 19:23

I am not familiar with Python , but the following regular expression can help you.

<\s*(\w+)[^/>]*>

where,

<: starting character

\s*: it may have whitespaces before tag name (ugly but possible).

(\w+): tags can contain letters and numbers (h1). Well, \w also matches '_', but it does not hurt I guess. If curious use ([a-zA-Z0-9]+) instead.

[^/>]*: anything except > and / until closing >

\>: closing >

Regex for extracting all regular text from html in python

3 Answers3

world