-3

I'm using NLTK to strip tags and leave text in an html file.

NLTK installs in seconds on my linux computer, but on Windows it's a pain to use, and I know that my client who lives in a different country will not be able to install the nltk module if I'm having trouble doing it.

What is a SIMPLE alternative that ships with python and that doesn't need to be installed? I need this as part of a script.

Kate Gregory
  • 18,808
  • 8
  • 56
  • 85
user1718373
  • 71
  • 1
  • 2
  • 4
  • 2
    why would you want to use NLTK (natural language processing TK) for parsing html? – root Oct 20 '12 at 14:06
  • possible duplicate of [Strip html from strings in python](http://stackoverflow.com/questions/753052/strip-html-from-strings-in-python) – Kate Gregory Oct 20 '12 at 14:17

2 Answers2

1

Was the question "How to remove HTML tags from a string?"

import re
def strip_tags(s):
    return re.sub("<[^>]+>", "", s)

Also, for future reference, you'll just want Christoph Gohlke's Python Extensions for Windows page.

EDIT: Fixed the regexp. D:

Double edit: inspired by the comments, here's an abomination.

def strip_tags(s):
     return re.sub(r"""</?\w+(\s*([^=]+=(?P<q>['"]).+?(?P=q))|\s*\w+(=\w+)?)*>""", "", s)
AKX
  • 152,115
  • 15
  • 115
  • 172
  • 1
    `strip_tags("

    This doesn't work

    ")` We can play this game all day -- HTML isn't a regular language, and so you [shouldn't parse it with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)..
    – DSM Oct 20 '12 at 13:54
  • Crap, typo in the original regexp, @DSM. – AKX Oct 20 '12 at 14:06
  • Either way the revised regexp _will_ strip tags, leaving only the plain-text content. Not sure if that's what OP wanted, but. – AKX Oct 20 '12 at 14:07
  • AKX: okay, you want to play? Let's play. :^) `"""

    Try harder! :> Your revision doesn't work either.

    """`.
    – DSM Oct 20 '12 at 14:15
  • Heh - yeah, I knew that'd come up. The above `strip_tags` works for the majority of HTML one might see in the wild, but you're right, not everything. – AKX Oct 20 '12 at 14:20
  • I can't object too strongly -- I've done similar things myself when I just wanted stuff to work quickly -- but inevitably on larger data samples I wind up with something that wasn't escaped that should have been, or a missing tag somewhere that breaks everything. :-/ – DSM Oct 20 '12 at 14:21
  • @DSM: I added a terrible thing. – AKX Oct 20 '12 at 14:34
0

You could try:

import xml.etree.ElementTree as ET

root = ET.parser('whatever')
text = filter(None, ((el.text or '').strip() for el in root.findall('.//*')))

Then what you do with text is up to you.

Jon Clements
  • 138,671
  • 33
  • 247
  • 280