Is there something like NLTK that comes with python, and that does not have to be installed?

Question

I'm using NLTK to strip tags and leave text in an html file.

NLTK installs in seconds on my linux computer, but on Windows it's a pain to use, and I know that my client who lives in a different country will not be able to install the nltk module if I'm having trouble doing it.

What is a SIMPLE alternative that ships with python and that doesn't need to be installed? I need this as part of a script.

why would you want to use NLTK (natural language processing TK) for parsing html? — root, Oct 20 '12 at 14:06
possible duplicate of [Strip html from strings in python](http://stackoverflow.com/questions/753052/strip-html-from-strings-in-python) — Kate Gregory, Oct 20 '12 at 14:17

AKX · Answer 1 · 2012-10-20T14:34:25.007

1

Was the question "How to remove HTML tags from a string?"

import re
def strip_tags(s):
    return re.sub("<[^>]+>", "", s)

Also, for future reference, you'll just want Christoph Gohlke's Python Extensions for Windows page.

EDIT: Fixed the regexp. D:

Double edit: inspired by the comments, here's an abomination.

def strip_tags(s):
     return re.sub(r"""</?\w+(\s*([^=]+=(?P<q>['"]).+?(?P=q))|\s*\w+(=\w+)?)*>""", "", s)

edited Oct 20 '12 at 14:34

answered Oct 20 '12 at 13:49

AKX

152,115
15
115
172

1

`strip_tags("
This doesn't work
")` We can play this game all day -- HTML isn't a regular language, and so you [shouldn't parse it with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags).. – DSM Oct 20 '12 at 13:54
Crap, typo in the original regexp, @DSM. – AKX Oct 20 '12 at 14:06
Either way the revised regexp _will_ strip tags, leaving only the plain-text content. Not sure if that's what OP wanted, but. – AKX Oct 20 '12 at 14:07
AKX: okay, you want to play? Let's play. :^) `"""
Your revision doesn't work either.
"""`. – DSM Oct 20 '12 at 14:15
Heh - yeah, I knew that'd come up. The above `strip_tags` works for the majority of HTML one might see in the wild, but you're right, not everything. – AKX Oct 20 '12 at 14:20
I can't object too strongly -- I've done similar things myself when I just wanted stuff to work quickly -- but inevitably on larger data samples I wind up with something that wasn't escaped that should have been, or a missing tag somewhere that breaks everything. :-/ – DSM Oct 20 '12 at 14:21
@DSM: I added a terrible thing. – AKX Oct 20 '12 at 14:34

score 0 · Answer 2 · answered Oct 20 '12 at 13:54

0

You could try:

import xml.etree.ElementTree as ET

root = ET.parser('whatever')
text = filter(None, ((el.text or '').strip() for el in root.findall('.//*')))

Then what you do with text is up to you.

answered Oct 20 '12 at 13:54

Jon Clements

138,671
33
247
280

Is there something like NLTK that comes with python, and that does not have to be installed?

2 Answers2