python: replacing a regular character with an unicode one using regex re.sub

Question

I have a simple problem that is driving me crazy, and seems to be due to the handling in python of unicode characters.

I have latex table stored on my disk (very similar to http://www.jwe.cc/downloads/table.tex), and I want to apply some regex on it so that hyphens - (\u2212) are replaced by en-dashes – (alt 0150 or \u2013)

I am using the following function that performs two different regex-and-replace.

import re
import glob

def mychanger(fileName):
  with open(fileName,'r') as file:
    str = file.read()
    str = str.decode("utf-8")
    str = re.sub(r"((?:^|[^{])\d+)\u2212(\d+[^}])","\\1\u2013\\2", str).encode("utf-8")
    str = re.sub(r"(^|[^0-9])\u2212(\d+)","\\1\u2013\\2", str).encode("utf-8")
  with open(fileName,'wb') as file:
    file.write(str)

myfile = glob.glob("C://*.tex")
for file in myfile: mychanger(file)

Unfortunately, this does not change anything.

It works though, if I use a non unicode character like $ instead of \u2013, which means the regex code is correct.

I am lost here, I tried using re.sub(ur"((?:^|[^{])\d+)\u2212(\d+[^}])","\\1\u2013\\2", str).encode("utf-8") but it still does not change anything.

What is wrong here? Thanks!

@cᴏʟᴅsᴘᴇᴇᴅ I get `TypeError: coercing to Unicode: need string or buffer, list found` — ℕʘʘḆḽḘ, Jul 10 '17 at 13:18
Noobie, why are you passing a list? It seems the code you post here isn't the code you're running. — cs95, Jul 10 '17 at 13:39
@cᴏʟᴅsᴘᴇᴇᴅ please see update. may come from `glob` — ℕʘʘḆḽḘ, Jul 10 '17 at 13:44
@Qeek still getting this `TypeError: coercing to Unicode: need string or buffer, list found`. — ℕʘʘḆḽḘ, Jul 10 '17 at 13:49
Try to encode the string only when all substitutions are done. — Qeek, Jul 10 '17 at 14:24
@Qeek, thanks! but I tried that, namely `str = str.decode("utf-8") str = re.sub(r"((?:^|[^{])\d+)" + "\u2212" + r"(\d+[^}])",u"\\1\u2013\\2", str).encode("utf-8")` — ℕʘʘḆḽḘ, Jul 10 '17 at 14:39
@Qeek still no changes. Can you possibly try using the latex table available on the link I posted? That stuff is driving me nuts!!!!! :( thanks again for helping — ℕʘʘḆḽḘ, Jul 10 '17 at 14:40
_DashPunctuation_ hyphen-minus is `-` (`\u002D`) differs from _MathSymbol_ Minus Sign `−` (`\u2212`) . — JosefZ, Jul 10 '17 at 14:54
@JosefZ still no change. does that work for you? if yes, can you please post a solution? — ℕʘʘḆḽḘ, Jul 10 '17 at 15:31

Alastair McCormack · Accepted Answer · 2017-07-10T17:33:23.487

Your example file actually contains HYPHEN-MINUS (U+002D) not U+2212.

If even if it did contain the right characters, you're hitting all the n00b issues of Python 2.x Unicode:

Decoding and encoding inline. In fact you encode twice!
Use of Unicode literal (\u2212) not in a Unicode string
Unnecessary use of r raw modifier

My advice is to remove all decodes and encodes and allow Python to do it for you. The io module backports the Python 3.x behaviour and decodes files for you. I've also renamed str to my_str to avoid conflicts with Python's own str class.

import re
import glob
import io

def mychanger(fileName):
    with io.open(fileName,'r', encoding="utf-8") as file:
        my_str = file.read()

        my_str = re.sub(u"((?:^|[^{])\d+)\u002d(\d+[^}])", u"\\1\u2013\\2", my_str)
        my_str = re.sub(u"(^|[^0-9])\u002d(\d+)",          u"\\1\u2013\\2", my_str)

    with io.open(fileName, 'w', encoding="utf-8") as file:
        file.write(my_str)

myfile = glob.glob(C://*.tex")

for file in myfile: mychanger(file)

For a thorough explanation of Python 2.x and Unicode, see How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte"

Thanks! Kind of make sense I hit the noobie issues dont you think :D — ℕʘʘḆḽḘ, Jul 10 '17 at 17:25
It's kinda unavoidable in Python 2 when you're copy-pasting code :) Oh and when the name fits ;) — Alastair McCormack, Jul 10 '17 at 17:26

python: replacing a regular character with an unicode one using regex re.sub

1 Answers1