all - I'm trying to perform a regex on a bunch of science data, converting certain special symbols into ASCII-friendly characters. For example, I want to replace 'µ'(UTF-8 \xc2\xb5) to the string 'micro', and '±' with '+/-'. I cooked up a python script to do this, which looks like this:
import re
def stripChars(string):
outString = (re.sub(r'\xc2\xb5+','micro', string)) #Metric 'micro (10^-6)' (Greek 'mu') letter
outString = (re.sub(r'\xc2\xb1+','+/-', outString)) #Scientific 'Plus-Minus' symbol
return outString
However, for these two specific characters, I'm getting strange results. I dug into it a bit, and it looks like I'm suffering from the bug described here, in which certain characters come out wrong because they are UTF data being interpreted as Windows-1252 (or ISO 8859-1).
I grepped the relevant data, and found that it is returning the erroneous result there as well (e.g. the 'µ' appears as 'µ') However, elsewhere in the same data set there exists datum in which the same symbol is displayed correctly. This may be due to a bug in the system which collected the data in the first place. The real weirdness is that it seems my current code only catches the incorrect version, letting the correct one pass through.
In any case, I'm really stuck on how to proceed. I need to be able to come up with a series of regex substitutions which will catch both the correct and incorrect versions of these characters, but the identifier for the correct version is failing in this case.
I must admit, I'm still fairly junior to programming, and anything more than the most basic regex is still like black magic to me. This problem seems a bit more intractable than any I've had to tackle before, and that's why I bring it to here to get some more eyes on it.
Thanks!