0

I have a simple problem that is driving me crazy, and seems to be due to the handling in python of unicode characters.

I have latex table stored on my disk (very similar to http://www.jwe.cc/downloads/table.tex), and I want to apply some regex on it so that hyphens - (\u2212) are replaced by en-dashes (alt 0150 or \u2013)

I am using the following function that performs two different regex-and-replace.

import re
import glob

def mychanger(fileName):
  with open(fileName,'r') as file:
    str = file.read()
    str = str.decode("utf-8")
    str = re.sub(r"((?:^|[^{])\d+)\u2212(\d+[^}])","\\1\u2013\\2", str).encode("utf-8")
    str = re.sub(r"(^|[^0-9])\u2212(\d+)","\\1\u2013\\2", str).encode("utf-8")
  with open(fileName,'wb') as file:
    file.write(str)

myfile = glob.glob("C://*.tex")
for file in myfile: mychanger(file)  

Unfortunately, this does not change anything.

It works though, if I use a non unicode character like $ instead of \u2013, which means the regex code is correct.

I am lost here, I tried using re.sub(ur"((?:^|[^{])\d+)\u2212(\d+[^}])","\\1\u2013\\2", str).encode("utf-8") but it still does not change anything.

What is wrong here? Thanks!

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235

1 Answers1

1

Your example file actually contains HYPHEN-MINUS (U+002D) not U+2212.

If even if it did contain the right characters, you're hitting all the n00b issues of Python 2.x Unicode:

  1. Decoding and encoding inline. In fact you encode twice!
  2. Use of Unicode literal (\u2212) not in a Unicode string
  3. Unnecessary use of r raw modifier

My advice is to remove all decodes and encodes and allow Python to do it for you. The io module backports the Python 3.x behaviour and decodes files for you. I've also renamed str to my_str to avoid conflicts with Python's own str class.

import re
import glob
import io

def mychanger(fileName):
    with io.open(fileName,'r', encoding="utf-8") as file:
        my_str = file.read()

        my_str = re.sub(u"((?:^|[^{])\d+)\u002d(\d+[^}])", u"\\1\u2013\\2", my_str)
        my_str = re.sub(u"(^|[^0-9])\u002d(\d+)",          u"\\1\u2013\\2", my_str)

    with io.open(fileName, 'w', encoding="utf-8") as file:
        file.write(my_str)

myfile = glob.glob(C://*.tex")

for file in myfile: mychanger(file)

For a thorough explanation of Python 2.x and Unicode, see How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte"

Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100