I am creating a python program that crawls and indexes a site, when i run my current code i get the error;
UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 0: character maps to <undefined>
I'm not sure why this error is occurring but i believe it to be due to my regex expressions. I decode the text then run it through multiple regex expressions to remove all links,brackets,hex values etc.
if (isinstance(page_contents, bytes)): #bytes to string
c = page_contents.decode('utf-8')
else:
c = page_contents
if isinstance(c, bytes):
print(' page not converted to string')
## the regex route
c = re.sub('\\\\n|\\\\r|\\\\t', ' ', c) # get rid of newlines, tabs
c = re.sub('\\\\\'', '\'', c) # replace \' with '
c = re.sub('<script.*?script>', ' ', c, flags=re.DOTALL) # get rid of scripts
c = re.sub('<!\[CDATA\[.*?\]\]', ' ', c, flags=re.DOTALL) # get rid of CDATA ?redundant
c = re.sub('<link.*?link>|<link.*?>', ' ', c, flags=re.DOTALL) # get rid of links
c = re.sub('<style.*?style>', ' ', c, flags=re.DOTALL) # get rid of links
c = re.sub('<.*?>', ' ', c, flags=re.DOTALL) # get rid of HTML tags
c = re.sub('\\\\x..', ' ', c) # get rid of hex values
c = re.sub('<--|-->', ' ', c, flags=re.DOTALL) # get rid of comments
c = re.sub('<|>', ' ', c) # get rid of stray angle brackets
c = re.sub('&.*?;|#.*?;', ' ', c) # get rid of HTML entities
page_text = re.sub('\s+', ' ', c) # replace multiple spaces with a single space
I then split the document it up into individual words which are then sorted and dealt with. But the problem occurs when i print it out. It loops round and prints out the data for the first url (document) extension but when it moves onto the second the error is outputted.
docids.append(url)
docid = str(docids.index(url))
##### stemming and other processing goes here #####
# page_text is the initial content, transformed to words
words = page_text
# Send document to stemmer
stemmed_doc = stem_doc(words)
# add the vocab counts and postings
for word in stemmed_doc.split():
if (word in vocab):
vocab[word] += 1
else:
vocab[word] = 1
if (not word in postings):
postings[word] = [docid]
elif (docid not in postings[word]):
postings[word].append(docid)
print('make_index3: docid=', docid, ' word=', word, ' count=', vocab[word], ' postings=', postings[word])
I would like to know if this error is due to incorrect regex or if there is something else occurring?
Solved
I added the expression
c = re.sub('[\W_]+', ' ', c)
which replaces all non alphanumerics with a space