Currently I'm writing a script that searches a .txt file for any measurement in micrometers. These text documents commonly use the mu symbol "µ" which is where the fun begins.
p = re.compile('\d+\.\d+\s?\-?[uUµ][mM]')
file = open("text_to_be_searched.txt").read()
file = file.decode("utf-8")
match = re.findall(p, file)
if match == []:
print "No matches found"
else:
for i in range(len(match)):
match[i] = match[i].replace("\n", "") #cleans up line breaks
print match[i] #prints correctly
print match #prints incorrectly
In the above code, iterating through the list prints the values nicely to the console.
1.06 µm
10.6 µm
3.8 µm
However, if I try to print the list, it displays them incorrectly.
[u'1.06 \xb5m', u'10.6 \xb5m', u'3.8 \xb5m']
Why does the print command display the iterated values correctly but the entire list incorrectly?
EDIT: Thanks to @BoarGules, and others.
I found that
match[i] = match[i].replace("µ", "u")
returned errors:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb5' in position 5: ordinal not in range(128)
Python is mad that the Unicode symbol isn't within the original 128 characters as explained on JoelonSoftware
But by simply telling it that the symbol was unicode:
match[i] = match[i].replace(u"µ", "u")
We get a more readable result.
[u'1.06 um', u'10.6 um', u'3.8 um']
It's a step in the right direction at least.