regex: replace hyphens with en-dashes with re.sub

Question

I am using a small function to loop over files so that any hyphens - get replaced by en-dashes – (alt + 0150).

The function I use adds some regex flavor to a solution in a related problem (how to replace a character INSIDE the text content of many files automatically?)

def mychanger(fileName):
  with open(fileName,'r') as file:
    str = file.read()
    str = str.decode("utf-8")
    str = re.sub(r"[^{]{1,4}(-)","–", str).encode("utf-8")
  with open(fileName,'wb') as file:
    file.write(str)

I used the regular expression [^{]{1,4}(-) because the search is actually performed on latex regression tables and I only want to replace the hyphens that occur around numbers.

To be clear: I want to replace all hyphens EXCEPT in cases where we have genuine latex code such as \cmidrule(lr){2-4}.

In this case there is a { close (within 3-4 characters max) to the hyphen and to the left of it. Of course, this hyphen should not be changed into an en-dash otherwise the latex code will break.
I think the left part condition of the exclusion is important to write the correct exception in regex. Indeed, in a regression table you can have things like -0.062\sym{***} (that is, a { on the close right of the hyphen) and in that case I do want to replace the hyphen.

A typical line in my table is

variable    &   -2.061\sym{***}&       4.032\sym{**}   &       1.236         \\
            &      (-2.32)         &   (-2.02)         &      (-0.14)

However, my regex does not appear to be correct. For instance, a (-1.2) will be replaced as –1.2, dropping the parenthesis.

What is the problem here? Thanks!

Please show us all instances where you _don't_ want a replacement to happen. — Tim Biegeleisen, Jul 10 '17 at 01:54
Can you provide a sample of your data and where it fails to match/matches erroneously? — zwer, Jul 10 '17 at 01:54
Hi thanks! @TimBiegeleisen @zwer can only think of situations like `\cmidrule(lr){2-8}` where there is a `{` close to the hyphen and to the left of it. Indeed, in a regression table you can have things like `-0.062\sym{***}` — ℕʘʘḆḽḘ, Jul 10 '17 at 01:57
@Noobie I attempted an answer below. Not clean at all, but let's see if we can iterate on this until we get something which solves your actual problem. — Tim Biegeleisen, Jul 10 '17 at 02:43

Tim Biegeleisen · Accepted Answer · 2017-07-10T02:50:20.050

2

I can offer the following two step replacement:

str = "-1 Hello \cmidrule(lr){2-4} range 1-5 other stuff a-5"
str = re.sub(r"((?:^|[^{])\d+)-(\d+[^}])","\\1$\\2", str).encode("utf-8")
str = re.sub(r"(^|[^0-9])-(\d+)","\\1$\\2", str).encode("utf-8")
print(str)

The first replacement targets all ranges which are not of the LaTex form {1-9} i.e. are not contained within curly braces. The second replacement targets all numbers prepended with a non number or the start of the string.

Demo

edited Jul 10 '17 at 02:50

answered Jul 10 '17 at 02:42

Tim Biegeleisen

502,043
27
286
360

amazing!!, but does that work with en-dashes? (not dollar signs) ultimately I want to have en-dashes instead of hyphens. Thanks!!! – ℕʘʘḆḽḘ Jul 10 '17 at 02:50
1

@Noobie I couldn't get the demo working with en dashes due to some encoding issue. Just replace dollar sign with en dash and it should work. Anyway for the purposes of a demo, I think dollar sign makes the replacements clearer and easier to read. – Tim Biegeleisen Jul 10 '17 at 02:51
let me try again. updated my question to provide more details on the text input. thanks again! – ℕʘʘḆḽḘ Jul 10 '17 at 02:53
I am gettinge unicode errors `UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 807: ordinal not in range(128)` – ℕʘʘḆḽḘ Jul 10 '17 at 02:57
you can actually find an example table here http://www.jwe.cc/downloads/table.tex this one does not contain the exception but otherwise all the tables look alike – ℕʘʘḆḽḘ Jul 10 '17 at 02:58
1

This is a Python encoding problem, and I won't be of much help there. – Tim Biegeleisen Jul 10 '17 at 02:58
maybe using the unicode representation? u2212 is unicode character for minus and u2013 for en-dash. please look at the last question here https://stackoverflow.com/questions/44996829/how-to-replace-a-character-inside-the-text-content-of-many-files-automatically – ℕʘʘḆḽḘ Jul 10 '17 at 03:02
I tried something like `str = re.sub(r"((?:^|[^{])\d+)\u2212(\d+[^}])","\\1\u2013\\2", str).encode("utf-8")` but this does not seem to be the correct syntax.... – ℕʘʘḆḽḘ Jul 10 '17 at 03:04
1

@Noobie See here: https://stackoverflow.com/questions/393843/python-and-regular-expression-with-unicode – Tim Biegeleisen Jul 10 '17 at 04:11
1

@Noobie You need to put some effort in here. I gave you what appears to be a valid pattern, you can handle the rest. – Tim Biegeleisen Jul 10 '17 at 10:59

score 1 · Answer 2 · answered Jul 10 '17 at 02:01

1

re.sub replaces the entire match. In this case that includes the non-{ character preceding your -. You can wrap that bit in parentheses to create a \1 group and include that in your substitution (you also don't need parentheses around your –):

re.sub(r"([^{]{1,4})-",r"\1–", str)

answered Jul 10 '17 at 02:01

Adam S

436
4
8

1

It's working for me in a python3 repl. Can you explain what you mean @TimBiegeleisen? – Adam S Jul 10 '17 at 02:04
I think what @TimBiegeleisen wants to say is that this does not replace the hypen only, is that correct? – ℕʘʘḆḽḘ Jul 10 '17 at 02:06
1

It should (I'm replacing the ascii hyphen with an X for emphasis): `In [18]: str = "(-1.2)" In [19]: re.sub(r"([^{]{1,4})-",r"\1X", str) Out[19]: '(X1.2)'` – Adam S Jul 10 '17 at 02:07
thanks Adam for your help. but `re.sub(r"([^{]{1,4})-",r"\1X", '-0.079\sym{***}')` gives `'-0.079\\sym{***}'` which will break the latex code – ℕʘʘḆḽḘ Jul 10 '17 at 02:10
1

This will replace LaTEX code. [See the demo here](http://rextester.com/YLEZL80905). I think a correct regex would be much more complex than this answer. – Tim Biegeleisen Jul 10 '17 at 02:12
thanks @TimBiegeleisen for your help as well. Do you mind posting a solution then? – ℕʘʘḆḽḘ Jul 10 '17 at 02:12
@AdamS I have updated my question for more clarity. I feel there was some misunderstanding – ℕʘʘḆḽḘ Jul 10 '17 at 02:15

regex: replace hyphens with en-dashes with re.sub

2 Answers2

Demo

Linked