Replace some diacritics with perl regex

Question

I want to replace some of the diacritics contained in a file by their ASCII equivalent. Please note that I don't want to remove all the diacritics: only those which are before the first "@" character of each line.

In the simplified version of the file below (a.glo), there are four "é" (in bold) to replace by "e". The (probably ugly) regex I use is:

(\\glossaryentry\{(\w|\s|\.)*)(é|è|ê|ë|É|È|Ê|Ë|ē)+

and it works with online tester like www.regex101.com/ and in notepad++!

But nothing is changed when I type in the Windows command line:

perl -pi -i.bak -e "s/(\\glossaryentry\{(\w|\s|\.)*)(é|è|ê|ë|É|È|Ê|Ë|ē)+/$1e/g" a.glo

(fwiw, on my system, perl is v.5.20.2)

a.glo:

\glossaryentry{AHRF@ {\memgloterm{AHRF}}{\memglodesc{Annales historiques de la Révolution française}} {\memgloref{}}|memjustarg}{1}

\glossaryentry{Ass. plén.@ {\memgloterm{Ass. plén.}}{\memglodesc{Assemblée plénière}} {\memgloref{}}|memjustarg}{1}

\glossaryentry{Ch. réun.@ {\memgloterm{Ch. réun.}}{\memglodesc{Chambres réunies}} {\memgloref{}}|memjustarg}{1}

\glossaryentry{chron.@ {\memgloterm{chron.}}{\memglodesc{chronique}} {\memgloref{}}|memjustarg}{1}

\glossaryentry{Circ. min.@ {\memgloterm{Circ. min.}}{\memglodesc{Circulaire ministérielle}} {\memgloref{}}|memjustarg}{1}

\glossaryentry{éd.@ {\memgloterm{éd.}}{\memglodesc{édition, édité par}} {\memgloref{}}|memjustarg}{1}

\glossaryentry{Int J Semiot Law@ {\memgloterm{Int J Semiot Law}}{\memglodesc{International Journal for the Semiotics of Law - Revue internationale de sémiotique juridique}} {\memgloref{}}|memjustarg}{1}

\glossaryentry{Oxford J Legal Studies@ {\memgloterm{Oxford J Legal Studies}}{\memglodesc{Oxford Journal of Legal Studies}} {\memgloref{}}|memjustarg}{1}

\glossaryentry{préc.@ {\memgloterm{préc.}}{\memglodesc{précité}} {\memgloref{}}|memjustarg}{1}

\glossaryentry{Rev. adm.@ {\memgloterm{Rev. adm.}}{\memglodesc{Revue administrative}} {\memgloref{}}|memjustarg}{1}

See [How to convert letters with accents, umlauts, etc to their ASCII counterparts in Perl?](http://stackoverflow.com/questions/11058211/how-to-convert-letters-with-accents-umlauts-etc-to-their-ascii-counterparts-in). Have you tried using single quotes instead of double quotes on the command line? — Håkon Hægland, Aug 09 '15 at 14:48
Works fine on Linux.. with single quotes: `perl -pe 's/(\\glossaryentry\{(\w|\s|\.)*)(é|è|ê|ë|É|È|Ê|Ë|ē)+/$1e/g' a.glo`. I saved the file `a.glo` using UTF-8 encoding. — Håkon Hægland, Aug 09 '15 at 14:59
I got an error ("'\s' is not recognized as an internal or external command”) with single quotes. Yes, `a.glo` is UTF-8. — Carg, Aug 09 '15 at 15:08
Yes maybe double quotes are needed on Windows.. have you tried using the `utf8` pragma? Add a `-Mutf8` option to the command line. — Håkon Hægland, Aug 09 '15 at 15:16
If I change the replace argument by removing the $1, the file is changed but the result is weird… `perl -pi -i.bak -e "s/(\\glossaryentry\{(\w|\s|\.)*)(é|è|ê|ë|É|È|Ê|Ë|e)+/e/g" a.glo` — Carg, Aug 09 '15 at 15:26
Try reduce the file to only two letters `eé` and run `perl -Mutf8 -pe "s/(é|è|ê|ë|É|È|Ê|Ë|ē)+/$1e/g" a.glo`.. What do you get? — Håkon Hægland, Aug 09 '15 at 15:37
`Malformed UTF-8 character (unexpected non-continuation byte 0x7c, immediately after start byte 0xcb) at -e line 1. ee├®` — Carg, Aug 09 '15 at 15:49
Ok, interesting.. let's first try remove non-ASCII characters from the command line: `perl -Mutf8 -pe "s/(e)+/$1e/g" a.glo`. Do you get the same error message? — Håkon Hægland, Aug 09 '15 at 15:53
Ok you did not get an error message.. it means that the error is probably located on the command line and not in the file. I suspect that the characters on the command line are ASCII encoded and not UTF-8..Can you check that the terminal window you are using is using UTF-8 encoding? — Håkon Hægland, Aug 09 '15 at 15:59
It's https://technet.microsoft.com/fr-fr/library/bb490874.aspx chcp 850 Multilingual (Latin I). Changing it to 65001, I get exactly the same error. Would it be the same problem if I don't do that with a oneliner? — Carg, Aug 09 '15 at 16:29
Let's continue this discussion in [chat](https://chat.stackoverflow.com/rooms/86551/replace-some-diacritics-with-perl-regex) — Håkon Hægland, Aug 09 '15 at 16:54
I think you've found the problem. That's working with a perl .pl script, so the Windows command line is the source of the issue… It displays the character correctly, but fails to interpret them correctly.The workaround is to avoid oneliner, but the solution is still to find (changing chcp to 65001 does not seem to be not enough). — Carg, Aug 09 '15 at 17:07

score 2 · Answer 1 · answered Aug 09 '15 at 17:31

I tried this on a windows box, it works.
I think though that the file has to open in its correct encoding.
I saved your text sample as ANSI text.

perl -pi -i.bak -e "s/(\\glossaryentry\{[\w\s.]*)[\x{E9}\x{E8}\x{EA}\x{EB}\x{C9}\x{C8}\x{CA}\x{CB}\x{113}]+/$1e/g" a.glo

 # (\\glossaryentry\{[\w\s.]*)[\x{E9}\x{E8}\x{EA}\x{EB}\x{C9}\x{C8}\x{CA}\x{CB}\x{113}]+

 (                             # (1 start)
      \\ glossaryentry \{
      [\w\s.]* 
 )                             # (1 end)
 [\x{E9}\x{E8}\x{EA}\x{EB}\x{C9}\x{C8}\x{CA}\x{CB}\x{113}]+

Replace some diacritics with perl regex

1 Answers1