Generalised method to clean data for text classification

Question

Whilst searching for a text classification method, I came across this Python code which was used in the pre-processing step

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    """
        text: a string 
        return: modified initial string
    """
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.
    text = BAD_SYMBOLS_RE.sub('', text) # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing. 
    text = text.replace('x', '')
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwords from text
    return text

OP

I then tested this section of code to understand the syntax and its purpose

BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
text = '[0a;m]'
BAD_SYMBOLS_RE.sub(' ', text)
# returns ' 0a m ' whilst I thought it would return '   ;  '

Question: why didn't the code replace 0, a, and m although 0-9a-z was specified inside the [ ]? Why did it replace ; although that character wasn't specified?

Edit to avoid being marked as duplication:

My perceptions of the code are:

The line BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]') is confusing. Including the characters #, +, and _ inside the [ ]made me think the line trying to remove the characters in the list (because no word in an English dictionary would contain those bad characters #+_, I believe?). Consequently, it made me interpret the ^ as the start of a string (instead of negation). Thus, the original post (which was kindly answered by Tim Pietzcker and Raymond Hettinger). The two lines REPLACE_BY_SPACE_RE and BAD_SYMBOLS_RE should had been combined into one such as

REMOVE_PUNCT = re.compile('[^0-9a-z]')
text = REMOVE_PUNCT.sub('', text)

I also think the code text = text.replace('x', '') (which was meant to remove the IDs that were masked as XXX-XXXX.... in the raw data) will lead to bad outcome, for example the word next will become net.

Additional questions:

Are my perceptions reasonable?
Should numbers/digits be removed from text?
Could you please recommend an overall/general strategy/code for text pre-processing for (English) text classification?

The `^` at the beginning of your character class means match any character **not** in this class — Nick, Jan 28 '20 at 06:42

score 2 · Answer 1 · answered Jan 28 '20 at 06:44

2

Here's some documentation about character classes.

Basically, [abc] means "any one of a, b, or c" whereas [^abc] means "any character that is not a, b, or c".

So your regex operation removes every non-digit, non-letter character except space, #, + and _ from the string, which explains the result you're getting.

answered Jan 28 '20 at 06:44

Tim Pietzcker

328,213
58
503
561

Thanks, @Tim Pietzcker. I overlooked the meaning of `^`, thinking it referred to the start of a string! Could you please confirm the text.replace('x', '') would lead to unintended outcome before I accept your answer? – Nemo Jan 28 '20 at 07:57
1

Yes, outside of a character class, `^` does mean "start of string". And your suspicion about what happens with `text.replace('x', '')` is correct. – Tim Pietzcker Jan 28 '20 at 08:03
Hi Tim. Sorry I had to reverse my acceptance as there were additional questions. – Nemo Jan 28 '20 at 11:25

Raymond Hettinger · Answer 2 · 2020-01-28T07:01:53.107

General rules

The square brackets specify any one single character.

Roughly [xyz] is a short-cut for (x|y|z) but without creating a group.

Likewise [a-z] is a short-cut for (a|b|c|...|y|z).

The interpretation of character sets can be a little tricky. The start and end points get converted to their ordinal positions and the range of matching characters is inferred from there. For example [A-z] converts A to 65 and z to 122, so everything from 65 to 122 is included. That means that it also matches characters like ^ which convert to 94. It also means that characters like ö won't match because that converts to 246 which is outside the range.

Another interesting form on character classes uses the ^ to invert the selection. For example, [^a-z] means "any character not in the range from a to z.

The full details are in the "character sets" section of the re docs.

Specific Problem

In the OP's example, BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]'), the caret ^ at the beginning inverts the range so that the listed symbols are excluded from the search.

That is why the code didn't replace 0, a, and m although 0-9a-z was specified inside the [ ]. Essentially, it treated the specified characters as good characters.

Hope this helps :-)

I'm learning new information (about ordinal positions)! Thanks, Raymond. I *always* feel bewildered when interpreting a regular expression as it has so many rules of which some of them seem overlapping! — Nemo, Jan 28 '20 at 07:05
@Nemo You're not the only one :-) Very few people know all of the symbols and all of the rules. — Raymond Hettinger, Jan 28 '20 at 07:58

Generalised method to clean data for text classification

2 Answers2