Whilst searching for a text classification method, I came across this Python code which was used in the pre-processing step
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))
def clean_text(text):
"""
text: a string
return: modified initial string
"""
text = text.lower() # lowercase text
text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.
text = BAD_SYMBOLS_RE.sub('', text) # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing.
text = text.replace('x', '')
text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwords from text
return text
OP
I then tested this section of code to understand the syntax and its purpose
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
text = '[0a;m]'
BAD_SYMBOLS_RE.sub(' ', text)
# returns ' 0a m ' whilst I thought it would return ' ; '
Question: why didn't the code replace 0
, a
, and m
although 0-9a-z
was specified inside the [ ]
? Why did it replace ;
although that character wasn't specified?
Edit to avoid being marked as duplication:
My perceptions of the code are:
- The line
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
is confusing. Including the characters#
,+
, and_
inside the[ ]
made me think the line trying to remove the characters in the list (because no word in an English dictionary would contain those bad characters#+_
, I believe?). Consequently, it made me interpret the^
as the start of a string (instead of negation). Thus, the original post (which was kindly answered by Tim Pietzcker and Raymond Hettinger). The two linesREPLACE_BY_SPACE_RE
andBAD_SYMBOLS_RE
should had been combined into one such as
REMOVE_PUNCT = re.compile('[^0-9a-z]')
text = REMOVE_PUNCT.sub('', text)
- I also think the code
text = text.replace('x', '')
(which was meant to remove the IDs that were masked as XXX-XXXX.... in the raw data) will lead to bad outcome, for example the wordnext
will becomenet
.
Additional questions:
Are my perceptions reasonable?
Should numbers/digits be removed from text?
Could you please recommend an overall/general strategy/code for text pre-processing for (English) text classification?