Regex conflict for certain characters (ISO-8859-1 Windows-1252)

Question

all - I'm trying to perform a regex on a bunch of science data, converting certain special symbols into ASCII-friendly characters. For example, I want to replace 'µ'(UTF-8 \xc2\xb5) to the string 'micro', and '±' with '+/-'. I cooked up a python script to do this, which looks like this:

import re
def stripChars(string):
    outString = (re.sub(r'\xc2\xb5+','micro', string)) #Metric 'micro (10^-6)' (Greek 'mu') letter
    outString = (re.sub(r'\xc2\xb1+','+/-', outString)) #Scientific 'Plus-Minus' symbol
    return outString

However, for these two specific characters, I'm getting strange results. I dug into it a bit, and it looks like I'm suffering from the bug described here, in which certain characters come out wrong because they are UTF data being interpreted as Windows-1252 (or ISO 8859-1).

I grepped the relevant data, and found that it is returning the erroneous result there as well (e.g. the 'µ' appears as 'Âµ') However, elsewhere in the same data set there exists datum in which the same symbol is displayed correctly. This may be due to a bug in the system which collected the data in the first place. The real weirdness is that it seems my current code only catches the incorrect version, letting the correct one pass through.

In any case, I'm really stuck on how to proceed. I need to be able to come up with a series of regex substitutions which will catch both the correct and incorrect versions of these characters, but the identifier for the correct version is failing in this case.

I must admit, I'm still fairly junior to programming, and anything more than the most basic regex is still like black magic to me. This problem seems a bit more intractable than any I've had to tackle before, and that's why I bring it to here to get some more eyes on it.

Thanks!

What Python version are you using Python2.x or Python 3.x. It matters when it comes to non ascii processing... — Serge Ballesta, Jul 18 '18 at 07:40
The use of raw strings such as `r'\xc2\xb5+'` seems wrong — you want the actual characters, not backslashes and such, right? — Tom Zych, Jul 18 '18 at 07:42
@TomZych as per the utf-8 cheat sheet I was looking at, that string is the hex representation of the Greek 'mu' (aka the 'micro symbol' in metric). When I simply copied and pasted the actual symbol into the regex, it didn't work at all. — spinflip36, Jul 18 '18 at 17:24
Whoops, my mistake. Python 2.7 `re` docs: *Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser*, including `\x`. So your strings are correct. Are you sure your input file is encoded as UTF-8? I’ve tried your code and it works for me. — Tom Zych, Jul 18 '18 at 18:07

score 3 · Answer 1 · answered Jul 18 '18 at 18:47

If your input data is encoded as UTF-8, your code should work. Here’s a complete program that works for me. It assumes the input is UTF-8 and simply operates on the raw bytes, not converting to or from Unicode. Note that I removed the + from the end of each input regex; that would accept one or more of the last character, which you probably didn’t intend.

import re

def stripChars(s):
    s = (re.sub(r'\xc2\xb5', 'micro', s)) # micro
    s = (re.sub(r'\xc2\xb1', '+/-', s)) # plus-or-minus
    return s

f_in = open('data')
f_out = open('output', 'w')

for line in f_in:
    print(type(line))
    line = stripChars(line)
    f_out.write(line)

If your data is encoded some other way (see for example this question for how to tell), this version will be more useful. You can specify any encoding for input and output. It decodes to internal Unicode on reading, acts on that when replacing, then encodes on writing.

import codecs
import re

encoding_in = 'iso8859-1'
encoding_out = 'ascii'

def stripChars(s):
    s = (re.sub(u'\u00B5', 'micro', s)) # micro
    s = (re.sub(u'\u00B1', '+/-', s)) # plus-or-minus
    return s

f_in = codecs.open('data-8859', 'r', encoding_in)
f_out = codecs.open('output', 'w', encoding_out)

for uline in f_in:
    uline = stripChars(uline)
    f_out.write(uline)

Note that it will raise an exception if it tries to write non-ASCII data with an ASCII encoding. The easy way to avoid this is to just write UTF-8, but then you may not notice uncaught characters. You can catch the exception and do something graceful. Or you can let the program crash and update it for the character(s) you’re missing.

score 2 · Answer 2 · edited Jul 19 '18 at 07:11

Ok, as you use a Python2 version, you read the file as byte strings, and your code should successfully translate all utf-8 encoded versions of µ (U+00B5) or ± (U+00B1).

This is coherent with what you later say:

my current code only catches the incorrect version, letting the correct one pass through

This is in fact perfectly correct. Let us first look at what exactly happen for µ. µ is u'\u00b5' it is encoded in utf-8 as '\xc2\xb5' and encoded in Latin1 or cp1252 as '\xb5'. As 'Â' is U+00C2, its Latin1 or cp1252 code is 0xc2. That means that a µ character correctly encoded in utf-8 will read as Âµ in a Windows 1252 system. And when it looks correct, it is because it is not utf-8 encoded but Latin1 encoded.

It looks that you are trying to process a file where parts are utf-8 encoded while others are Latin1 (or cp1252) encoded. You really should try to fix that in the system that is collecting data because it can cause hard to recover trouble.

The good news is that it can be fixed here because you only want to process 2 non ASCII characters: you just have to try to decode the utf-8 version as you do, and then try in a second pass to decode the Latin1 version. Code could be (ne need for regexes here):

def stripChars(string):
    outString = string.replace('\xc2\xb5','micro') #Metric 'micro (10^-6)' (Greek 'mu') letter in utf-8
    outString = outString.replace('\xb5','micro') #Metric 'micro (10^-6)' (Greek 'mu') letter in Latin1
    outString = outString.replace('\xc2\xb1','+/-') #Scientific 'Plus-Minus' symbol in utf-8
    outString = outString.replace('\xb1','+/-') #Scientific 'Plus-Minus' symbol in Latin1
    return outString

For references Latin1 AKA ISO-8859-1 encoding has the exact unicode values for all unicode character below 256. Window code page 1252 (cp1252 in Python) is a Windows variation of the Latin1 encoding where some characters normally unused in Latin1 are used for higher code characters. For example € (U+20AC) is encoded as '\80' in cp1252 while it does not exist at all in Latin1.

Regex conflict for certain characters (ISO-8859-1 Windows-1252)

2 Answers2