Python html processing

Question

I have a html file with russian text. How i can get all words in text without html tags, special symbols, etc ?

Example:

<html>...<body>...<div id='text'>Foo bar! Foo, bar.</div></body></html>

I need:

['foo','bar','Foo','bar']

I tried nltk, but it does not support russian words.

http://stackoverflow.com/questions/753052/strip-html-from-strings-in-python — reclosedev, Feb 10 '12 at 16:04
possible duplicate of [Removing HTML tags from a unicode string in Python](http://stackoverflow.com/questions/3224358/removing-html-tags-from-a-unicode-string-in-python) — Steven Rumbalski, Feb 10 '12 at 16:27

score 4 · Answer 1 · answered Feb 10 '12 at 16:05

4

Definitely try BeautifulSoup, it supports Unicode.

answered Feb 10 '12 at 16:05

UncleZeiv

18,272
7
49
77

1

While I support this answer (+1), I feel that I should warn that BeautifulSoup is basically deprecated. I've used it before and love it, but there isn't as much official support for it anymore – inspectorG4dget Feb 10 '12 at 16:09
Not true -- as of yesterday, BeautifulSoup 4 is out in beta. – grifaton Feb 10 '12 at 16:26

score 4 · Answer 2 · answered Feb 10 '12 at 16:09

4

I'm using lxml library to parse xml/html. lxml works good with any unicode data.

answered Feb 10 '12 at 16:09

Ivan Kolodyazhny

574
2
8

`lxml.html.fromstring(s).text_content()` But OP wants to spit text to words too. – reclosedev Feb 10 '12 at 16:14
It would be a second stage of html processing. Split words is simple task generally but could be very language specific. So I prefer to split words in the second stage of html parsing. – Ivan Kolodyazhny Feb 10 '12 at 16:54
Could you provide me some example of the html you need to parse? – Ivan Kolodyazhny Feb 10 '12 at 19:56

score 0 · Answer 3 · answered Feb 11 '12 at 19:34

Use lxml. It can strip tags, elements, and more:

import urllib2

from lxml import etree


URL = 'http://stackoverflow.com/questions/9230675/python-html-processing'

html = urllib2.urlopen(URL).read()
tree = etree.fromstring(html, parser=etree.HTMLParser())

tree.xpath('//script')
# [<Element script at 102f831b0>,
#  ...
#  <Element script at 102f83ba8>]

tree.xpath('//style')
# [<Element style at 102f83c58>]

tags_to_strip = ['script', 'style']
etree.strip_elements(tree, *tags_to_strip)

tree.xpath('//style')
# []

tree.xpath('//script')
# []

body = tree.xpath('//body')
body = body[0]

text = ' '.join(body.itertext())
tokens = text.split()
# [u'Stack',
#  u'Exchange',
#  u'log',
#  u'in',
#  ...
#  u'Stack',
#  u'Overflow',
#  u'works',
#  u'best',
#  u'with',
#  u'JavaScript',
#  u'enabled']

In case of text in russian you get tokens looking likes this:

# [u'\xd1\x8d\xd1\x84\xd1\x84\xd0\xb5\xd0\xba\xd1\x82\xd1\x8b\xe2\x80\xa6',
#  u'\xd0\x9c\xd0\xb0\xd1\x80\xd0\xba',
#  ...
#  u'\xd0\x9c\xd0\xb0\xd0\xb9\xd0\xb5\xd1\x80']

Errors handling is your home assignment.

score 0 · Answer 4 · answered Feb 23 '12 at 02:36

Use regex to remove the tags. Nltk is all about language analysis (nouns vs verbs) and word meaning (semantics) not string removal and pattern matching although I can see how someoneaybe confused.

Here is a removal function using regex

import re
def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

Python html processing

4 Answers4