Python Regex to remove all HTML data

Question

I am creating a python program that crawls and indexes a site, when i run my current code i get the error;

UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 0: character maps to <undefined>

I'm not sure why this error is occurring but i believe it to be due to my regex expressions. I decode the text then run it through multiple regex expressions to remove all links,brackets,hex values etc.

if (isinstance(page_contents, bytes)):     #bytes to string
        c = page_contents.decode('utf-8')
    else:
        c = page_contents
    if isinstance(c, bytes):
        print(' page not converted to string')

## the regex route
c = re.sub('\\\\n|\\\\r|\\\\t', ' ', c)  # get rid of newlines, tabs
c = re.sub('\\\\\'', '\'', c)  # replace \' with '
c = re.sub('<script.*?script>', ' ', c, flags=re.DOTALL)  # get rid of scripts
c = re.sub('<!\[CDATA\[.*?\]\]', ' ', c, flags=re.DOTALL)  # get rid of CDATA ?redundant
c = re.sub('<link.*?link>|<link.*?>', ' ', c, flags=re.DOTALL)  # get rid of links
c = re.sub('<style.*?style>', ' ', c, flags=re.DOTALL)  # get rid of links
c = re.sub('<.*?>', ' ', c, flags=re.DOTALL)  # get rid of HTML tags
c = re.sub('\\\\x..', ' ', c)  # get rid of hex values
c = re.sub('<--|-->', ' ', c, flags=re.DOTALL)  # get rid of comments
c = re.sub('<|>', ' ', c)  # get rid of stray angle brackets
c = re.sub('&.*?;|#.*?;', ' ', c)  # get rid of HTML entities
page_text = re.sub('\s+', ' ', c)  # replace multiple spaces with a single space

I then split the document it up into individual words which are then sorted and dealt with. But the problem occurs when i print it out. It loops round and prints out the data for the first url (document) extension but when it moves onto the second the error is outputted.

docids.append(url)
docid = str(docids.index(url))

##### stemming and other processing goes here #####
# page_text is the initial content, transformed to words
words = page_text
#   Send document to stemmer
stemmed_doc = stem_doc(words)

# add the vocab counts and postings
for word in stemmed_doc.split():
    if (word in vocab):
        vocab[word] += 1
    else:
        vocab[word] = 1
    if (not word in postings):
        postings[word] = [docid]
    elif (docid not in postings[word]):
        postings[word].append(docid)

    print('make_index3: docid=', docid, ' word=', word, ' count=', vocab[word], ' postings=', postings[word])

I would like to know if this error is due to incorrect regex or if there is something else occurring?

Solved

I added the expression

c = re.sub('[\W_]+', ' ', c)

which replaces all non alphanumerics with a space

Please check [this answer of mine](http://stackoverflow.com/a/33128359/3832970), I hope it will help. — Wiktor Stribiżew, Nov 04 '15 at 14:28
Are you trying to parse HTML with regex? That's generally not a great idea - *use an HTML parser.* — jonrsharpe, Nov 04 '15 at 14:42
[Insert link to The Famous Answer here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — Roberto, Nov 04 '15 at 14:47
Other comments here are correct that your approach of using regexes to "sanitize" the page contents is fundamentally flawed. But your problem here isn't with the regex, it's with how you convert the bytes to a string. Not all web pages will use UTF-8. Instead you need to parse the `Content-Type` header (which can be overridden in a `` tag) to determine the correct encoding. — Daniel Pryden, Nov 04 '15 at 15:19

score 1 · Answer 1 · answered Nov 04 '15 at 14:57

1

The problem you get seems to be with enconding and no with regex. Have you tried changing

c = page_contents.decode('utf-8')

and using anothed encoding, for example:

c = page_contents.decode('latin-1')

?

answered Nov 04 '15 at 14:57

nsm

319
1
9

The correct encoding to use will be part of the HTTP response, either in the `Content-Type` header or in a `` tag. Simply guessing a different encoding isn't any better. – Daniel Pryden Nov 04 '15 at 15:13
ok, I was just stating that the problem reported was not with regex but with enconding, and suggesting a way to check that, not giving a solution. – nsm Nov 04 '15 at 15:18
Yep... you are correct about that (I just added a similar comment above). – Daniel Pryden Nov 04 '15 at 15:20

score 0 · Accepted Answer · answered Nov 15 '15 at 16:47

0

this worked, replaced all non alphanumerics with a space

c = re.sub('[\W_]+', ' ', c)

answered Nov 15 '15 at 16:47

Python Regex to remove all HTML data

2 Answers2