python re prevent stripping whitespaces

Question

:) I'm not sure why the following python code removes whitespaces too, but it does. Could someone please explain how I could pull this off without it doing so? Thank you ! :)

text = html
rules = [
    { r'>\s+' : u'>'},
    { r'\s+' : u' '},
    { r'\s*<br\s*/?>\s*' : u'\n'},
    { r'</(div)\s*>\s*' : u'\n'},
    { r'</(p|h\d)\s*>\s*' : u'\n\n'},
    { r'<head>.*<\s*(/head|body)[^>]*>' : u'' },
    { r'<a\s+href="([^"]+)"[^>]*>.*</a>' : r'\1' },
    { r'[ \t]*<[^<]*?/?>' : u'' },
    { r'^\s+' : u'' }
]
for rule in rules:
    for (k,v) in rule.items():
        regex = re.compile (k)
        text  = regex.sub (v, text)
print text

You really shouldn't try to parse HTML using regexes. It will all end in tears. — Wooble, Apr 25 '12 at 12:52

score 1 · Accepted Answer · answered Apr 25 '12 at 12:49

1

As you can read in the docs: http://docs.python.org/library/re.html

The \s sequence matches all whitespace. So the bottom rule will remove all whitespace.

answered Apr 25 '12 at 12:49

Wolph

78,177
11
137
148

score 0 · Answer 2 · answered Apr 25 '12 at 12:51

0

In addition to WoLpH's answer, your first 5 re's end with some variant of \s and replace that with a string that contains no whitespace (other than newlines) at the end...

answered Apr 25 '12 at 12:51

mgilson

300,191
65
633
696

python re prevent stripping whitespaces

2 Answers2