0

I'm trying to do the following:

LC_CTYPE=C sed 's/|/¦/g' t.txt > new_t.txt

The code is working but, when I open the new file, the replace adds an additional character "A¦". Why is that?

Enlico
  • 23,259
  • 6
  • 48
  • 102
Javier Muñoz
  • 732
  • 1
  • 11
  • 30
  • Depends on how you typed the ¦ character and how you are viewing the file. I'm guessing your command line represented that as UTF-8 whereas you are apparently using something else (Latin-1?) to view the file (though strictly speaking that should give you `¦`, not `A¦`). Perhaps see also https://meta.stackoverflow.com/questions/379403/problematic-questions-about-decoding-errors – tripleee Jan 13 '21 at 13:11
  • As regards your question before me editing it, [**do not use signature, taglines, or greetings**](https://stackoverflow.com/help/behavior). – Enlico Jan 13 '21 at 13:11
  • This is almost certainly a duplicate, but I fail to find one which is very specific to `sed` on macOS. – tripleee Jan 13 '21 at 13:31

1 Answers1

0

When you typed

LC_CTYPE=C sed 's/|/¦/g' t.txt > new_t.txt

your shell was probably configured to accept the command itself as UTF-8, and so in fact you ended up converting the single byte 0x7C (U+007C) to the two bytes 0xC2 0xA6 which is the correct UTF-8 encoding for U+00A6.

What you then did is unclear, but somehow you ended up examining the file in some other encoding than UTF-8, which exposes the two bytes as the string you report seeing.

The correct workaround is to examine the file in a correctly configured program which supports UTF-8.

tripleee
  • 175,061
  • 34
  • 275
  • 318