0

Currently I'm writing a script that searches a .txt file for any measurement in micrometers. These text documents commonly use the mu symbol "µ" which is where the fun begins.

p = re.compile('\d+\.\d+\s?\-?[uUµ][mM]')
file = open("text_to_be_searched.txt").read()
file = file.decode("utf-8")

match = re.findall(p, file)
if match == []:
    print "No matches found"
else:
    for i in range(len(match)):
        match[i] = match[i].replace("\n", "") #cleans up line breaks
        print match[i] #prints correctly

print match #prints incorrectly

In the above code, iterating through the list prints the values nicely to the console.

1.06 µm
10.6 µm
3.8 µm

However, if I try to print the list, it displays them incorrectly.

[u'1.06 \xb5m', u'10.6 \xb5m', u'3.8 \xb5m']

Why does the print command display the iterated values correctly but the entire list incorrectly?

EDIT: Thanks to @BoarGules, and others.

I found that

match[i] = match[i].replace("µ", "u")

returned errors:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xb5' in position 5: ordinal not in range(128)

Python is mad that the Unicode symbol isn't within the original 128 characters as explained on JoelonSoftware

But by simply telling it that the symbol was unicode:

match[i] = match[i].replace(u"µ", "u")

We get a more readable result.

[u'1.06 um', u'10.6 um', u'3.8 um']

It's a step in the right direction at least.

tameless
  • 13
  • 3
  • Does this answer your question? https://stackoverflow.com/a/21968640/6814540 – W2a Jun 08 '17 at 19:47
  • It gave me something to go on. print sys.stdout.encoding gave me "cp1252" I'm not sure why since I have # -*- coding: utf-8 -*- in the header. – tameless Jun 08 '17 at 20:08
  • All this ' # -- coding: utf-8 - ' is to interpret the file as unicode. Guess the cmd's encoding is cp1252 in your case.. Try changing it with google. Anyway, my best guess is python cannot unicode interpret a list. – W2a Jun 08 '17 at 20:18
  • 2
    The list is *not* displayed incorrectly. When you print a list, its `repr()` form is used, including the `repr()` form of all of its elements. In Python 2, unicode strings represent non-ASCII characters as escape sequences. Try `print repr(u'1 µm')` – this is intentional. If you don't like the escapes, switch to Python 3. – lenz Jun 08 '17 at 20:39
  • 1
    @J.Doe, cmd is a shell. It doesn't have an encoding, and it's not related to the problem. You're talking about the console that python.exe is attached to. Python 3.6 added support for writing Unicode to the console. For older versions and Python 2, you can install and enable the `win_unicode_console` module. – Eryk Sun Jun 08 '17 at 21:55

1 Answers1

0

This is not really incorrect:

[u'1.06 \xb5m', u'10.6 \xb5m', u'3.8 \xb5m']

It is the way you would have to type the manually into your program. If you tried to do this:

['1.06 µm','10.6 µm','3.8 µm']

you would get a source encoding error (unless you put an encoding comment at the top of your program.

It is just a different representation of the same data. Recall that a list is a data structure. You can't actually print it as it is in memory because that is just a bunch of bytes. It has to be interpreted into something resembling program code, in other words, turned into a string, to be printed. The interpreter does a generic job. It has to display the difference between normal str-type strings and unicode strings (hence the u"...") and it has to escape characters outside of the ascii character set. If it didn't do that it would be much less useful.

If you have fixed ideas about how the list should be displayed then you need to format it yourself for output.

BoarGules
  • 16,440
  • 2
  • 27
  • 44
  • I found that match[i] = match[i].replace("µ", "u") returned errors: UnicodeEncodeError: 'ascii' codec can't encode character u'\xb5' in position 5: ordinal not in range(128) But by simply telling it that the symbol was unicode: match[i] = match[i].replace(u"µ", "u") We get a more readable result. [u'1.06 um', u'10.6 um', u'3.8 um'] It's a step in the right direction at least. – tameless Jun 09 '17 at 18:48
  • 1
    Stop trying to make `print(mylist)` do what you expect. It won't, at least not always. You have to take your own control over formatting if you don't like the default. Try this: `print ', '.join(e for e in mylist)` – BoarGules Jun 09 '17 at 20:51