3

I am using a small function to loop over files so that any hyphens - get replaced by en-dashes (alt + 0150).

The function I use adds some regex flavor to a solution in a related problem (how to replace a character INSIDE the text content of many files automatically?)

def mychanger(fileName):
  with open(fileName,'r') as file:
    str = file.read()
    str = str.decode("utf-8")
    str = re.sub(r"[^{]{1,4}(-)","–", str).encode("utf-8")
  with open(fileName,'wb') as file:
    file.write(str)

I used the regular expression [^{]{1,4}(-) because the search is actually performed on latex regression tables and I only want to replace the hyphens that occur around numbers.

To be clear: I want to replace all hyphens EXCEPT in cases where we have genuine latex code such as \cmidrule(lr){2-4}.

  • In this case there is a { close (within 3-4 characters max) to the hyphen and to the left of it. Of course, this hyphen should not be changed into an en-dash otherwise the latex code will break.

  • I think the left part condition of the exclusion is important to write the correct exception in regex. Indeed, in a regression table you can have things like -0.062\sym{***} (that is, a { on the close right of the hyphen) and in that case I do want to replace the hyphen.

A typical line in my table is

variable    &   -2.061\sym{***}&       4.032\sym{**}   &       1.236         \\
            &      (-2.32)         &   (-2.02)         &      (-0.14)    

However, my regex does not appear to be correct. For instance, a (-1.2) will be replaced as –1.2, dropping the parenthesis.

What is the problem here? Thanks!

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
  • 1
    Please show us all instances where you _don't_ want a replacement to happen. – Tim Biegeleisen Jul 10 '17 at 01:54
  • 1
    Can you provide a sample of your data and where it fails to match/matches erroneously? – zwer Jul 10 '17 at 01:54
  • Hi thanks! @TimBiegeleisen @zwer can only think of situations like `\cmidrule(lr){2-8}` where there is a `{` close to the hyphen and to the left of it. Indeed, in a regression table you can have things like `-0.062\sym{***}` – ℕʘʘḆḽḘ Jul 10 '17 at 01:57
  • 1
    @Noobie I attempted an answer below. Not clean at all, but let's see if we can iterate on this until we get something which solves your actual problem. – Tim Biegeleisen Jul 10 '17 at 02:43

2 Answers2

2

I can offer the following two step replacement:

str = "-1 Hello \cmidrule(lr){2-4} range 1-5 other stuff a-5"
str = re.sub(r"((?:^|[^{])\d+)-(\d+[^}])","\\1$\\2", str).encode("utf-8")
str = re.sub(r"(^|[^0-9])-(\d+)","\\1$\\2", str).encode("utf-8")
print(str)

The first replacement targets all ranges which are not of the LaTex form {1-9} i.e. are not contained within curly braces. The second replacement targets all numbers prepended with a non number or the start of the string.

Demo

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • amazing!!, but does that work with en-dashes? (not dollar signs) ultimately I want to have en-dashes instead of hyphens. Thanks!!! – ℕʘʘḆḽḘ Jul 10 '17 at 02:50
  • 1
    @Noobie I couldn't get the demo working with en dashes due to some encoding issue. Just replace dollar sign with en dash and it should work. Anyway for the purposes of a demo, I think dollar sign makes the replacements clearer and easier to read. – Tim Biegeleisen Jul 10 '17 at 02:51
  • let me try again. updated my question to provide more details on the text input. thanks again! – ℕʘʘḆḽḘ Jul 10 '17 at 02:53
  • I am gettinge unicode errors `UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 807: ordinal not in range(128)` – ℕʘʘḆḽḘ Jul 10 '17 at 02:57
  • you can actually find an example table here http://www.jwe.cc/downloads/table.tex this one does not contain the exception but otherwise all the tables look alike – ℕʘʘḆḽḘ Jul 10 '17 at 02:58
  • 1
    This is a Python encoding problem, and I won't be of much help there. – Tim Biegeleisen Jul 10 '17 at 02:58
  • maybe using the unicode representation? u2212 is unicode character for minus and u2013 for en-dash. please look at the last question here https://stackoverflow.com/questions/44996829/how-to-replace-a-character-inside-the-text-content-of-many-files-automatically – ℕʘʘḆḽḘ Jul 10 '17 at 03:02
  • I tried something like `str = re.sub(r"((?:^|[^{])\d+)\u2212(\d+[^}])","\\1\u2013\\2", str).encode("utf-8")` but this does not seem to be the correct syntax.... – ℕʘʘḆḽḘ Jul 10 '17 at 03:04
  • 1
    @Noobie See here: https://stackoverflow.com/questions/393843/python-and-regular-expression-with-unicode – Tim Biegeleisen Jul 10 '17 at 04:11
  • 1
    @Noobie You need to put some effort in here. I gave you what appears to be a valid pattern, you can handle the rest. – Tim Biegeleisen Jul 10 '17 at 10:59
1

re.sub replaces the entire match. In this case that includes the non-{ character preceding your -. You can wrap that bit in parentheses to create a \1 group and include that in your substitution (you also don't need parentheses around your ):

re.sub(r"([^{]{1,4})-",r"\1–", str)
Adam S
  • 436
  • 4
  • 8