258

I'm trying to replace a string in a Makefile on Mac OS X for cross-compiling to iOS. The string has embedded double quotes. The command is:

sed -i "" 's|"iphoneos-cross","llvm-gcc:-O3|"iphoneos-cross","clang:-Os|g' Configure

And the error is:

sed: RE error: illegal byte sequence

I've tried escaping the double quotes, commas, dashes, and colons with no joy. For example:

sed -i "" 's|\"iphoneos-cross\"\,\"llvm-gcc\:\-O3|\"iphoneos-cross\"\,\"clang\:\-Os|g' Configure

I'm having a heck of a time debugging the issue. Does anyone know how to get sed to print the position of the illegal byte sequence? Or does anyone know what the illegal byte sequence is?

jww
  • 97,681
  • 90
  • 411
  • 885
  • 4
    Illegal byte sequence sounds like something you get when feeding 8-bit ascii to something that expects utf-8. – Klas Lindbäck Oct 08 '13 at 08:08
  • 74
    Can you try: `LC_CTYPE=C && LANG=C && sed command` – anubhava Oct 08 '13 at 08:09
  • 10
    Thanks folks. Its was the `LANG` thing. Sigh.... – jww Oct 08 '13 at 08:10
  • Did anyone know how to determine the start of the sequence being flagged as invalid? `sed -v` caused an error in the command, and the `man` pages did not discuss the topic. – jww Oct 09 '13 at 01:57
  • Can someone enlighten me how the command line shown can be valid, with that empty argument after the `-i`? – user2719058 Nov 04 '13 at 18:09
  • 7
    @user2719058: BSD `sed` (as also used on OS X) requires `-i ''` (separate, empty-string option-argument) for in-place updating without a backup file; with GNU `sed`, only `-i` by itself works - see http://stackoverflow.com/a/40777793/45375 – mklement0 Dec 14 '16 at 18:50
  • 6
    Plus one for the LANG thing. Good grief, that's obscure, non-obvious and surprisingly difficult to research. – Spudley Apr 15 '19 at 14:34
  • 1
    `LC_CTYPE=C LANG=C sed command` should work as well – forzagreen Apr 21 '21 at 13:15

8 Answers8

389

A sample command that exhibits the symptom: sed 's/./@/' <<<$'\xfc' fails, because byte 0xfc is not a valid UTF-8 char.
Note that, by contrast, GNU sed (Linux, but also installable on macOS) simply passes the invalid byte through, without reporting an error.

Using the formerly accepted answer is an option if you don't mind losing support for your true locale (if you're on a US system and you never need to deal with foreign characters, that may be fine.)

However, the same effect can be had ad-hoc for a single command only:

LC_ALL=C sed -i "" 's|"iphoneos-cross","llvm-gcc:-O3|"iphoneos-cross","clang:-Os|g' Configure

Note: What matters is an effective LC_CTYPE setting of C, so LC_CTYPE=C sed ... would normally also work, but if LC_ALL happens to be set (to something other than C), it will override individual LC_*-category variables such as LC_CTYPE. Thus, the most robust approach is to set LC_ALL.

However, (effectively) setting LC_CTYPE to C treats strings as if each byte were its own character (no interpretation based on encoding rules is performed), with no regard for the - multibyte-on-demand - UTF-8 encoding that OS X employs by default, where foreign characters have multibyte encodings.

In a nutshell: setting LC_CTYPE to C causes the shell and utilities to only recognize basic English letters as letters (the ones in the 7-bit ASCII range), so that foreign chars. will not be treated as letters, causing, for instance, upper-/lowercase conversions to fail.

Again, this may be fine if you needn't match multibyte-encoded characters such as é, and simply want to pass such characters through.

If this is insufficient and/or you want to understand the cause of the original error (including determining what input bytes caused the problem) and perform encoding conversions on demand, read on below.


The problem is that the input file's encoding does not match the shell's.
More specifically, the input file contains characters encoded in a way that is not valid in UTF-8 (as @Klas Lindbäck stated in a comment) - that's what the sed error message is trying to say by invalid byte sequence.

Most likely, your input file uses a single-byte 8-bit encoding such as ISO-8859-1, frequently used to encode "Western European" languages.

Example:

The accented letter à has Unicode codepoint 0xE0 (224) - the same as in ISO-8859-1. However, due to the nature of UTF-8 encoding, this single codepoint is represented as 2 bytes - 0xC3 0xA0, whereas trying to pass the single byte 0xE0 is invalid under UTF-8.

Here's a demonstration of the problem using the string voilà encoded as ISO-8859-1, with the à represented as one byte (via an ANSI-C-quoted bash string ($'...') that uses \x{e0} to create the byte):

Note that the sed command is effectively a no-op that simply passes the input through, but we need it to provoke the error:

  # -> 'illegal byte sequence': byte 0xE0 is not a valid char.
sed 's/.*/&/' <<<$'voil\x{e0}'

To simply ignore the problem, the above LCTYPE=C approach can be used:

  # No error, bytes are passed through ('á' will render as '?', though).
LC_CTYPE=C sed 's/.*/&/' <<<$'voil\x{e0}'

If you want to determine which parts of the input cause the problem, try the following:

  # Convert bytes in the 8-bit range (high bit set) to hex. representation.
  # -> 'voil\x{e0}'
iconv -f ASCII --byte-subst='\x{%02x}' <<<$'voil\x{e0}'

The output will show you all bytes that have the high bit set (bytes that exceed the 7-bit ASCII range) in hexadecimal form. (Note, however, that that also includes correctly encoded UTF-8 multibyte sequences - a more sophisticated approach would be needed to specifically identify invalid-in-UTF-8 bytes.)


Performing encoding conversions on demand:

Standard utility iconv can be used to convert to (-t) and/or from (-f) encodings; iconv -l lists all supported ones.

Examples:

Convert FROM ISO-8859-1 to the encoding in effect in the shell (based on LC_CTYPE, which is UTF-8-based by default), building on the above example:

  # Converts to UTF-8; output renders correctly as 'voilà'
sed 's/.*/&/' <<<"$(iconv -f ISO-8859-1 <<<$'voil\x{e0}')"

Note that this conversion allows you to properly match foreign characters:

  # Correctly matches 'à' and replaces it with 'ü': -> 'voilü'
sed 's/à/ü/' <<<"$(iconv -f ISO-8859-1 <<<$'voil\x{e0}')"

To convert the input BACK to ISO-8859-1 after processing, simply pipe the result to another iconv command:

sed 's/à/ü/' <<<"$(iconv -f ISO-8859-1 <<<$'voil\x{e0}')" | iconv -t ISO-8859-1
mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 6
    I'd say this is a much better option. First, I wouldn't want to lose multi-language support in all of Terminal. Second, the accepted answer feels like a global solution to a local problem - something to be avoided. – Alex May 28 '14 at 17:42
  • I had a couple of small tweaks to this. I'd appreciate feedback. http://stackoverflow.com/a/35046218/9636 – Heath Borders Jan 27 '16 at 19:22
  • `LC_CTYPE=C sed 's/.*/&/' <<<$'voil\x{e0}'` prints `sed: RE error: illegal byte sequence` for me on Sierra. `echo $LC_ALL` outputs `en_US.UTF-8` FWIW. – ahcox Feb 08 '18 at 19:35
  • 4
    @ahcox: Yes, because setting `LC_ALL` _overrides_ all other `LC_*` variables, including `LC_CTYPE`, as explained in the answer. – mklement0 Feb 08 '18 at 19:55
  • 3
    @mklement0 Cool, this works: "LC_ALL=C sed 's/.*/&/' <<<$'voil\x{e0}'". Precedence explained here for my fellow inattentive ignoramuses: http://pubs.opengroup.org/onlinepubs/7908799/xbd/envvar.html – ahcox Feb 08 '18 at 20:49
  • Can you tell why `echo "HÂBnc" | sed 's/[A-Z]*/\`&\`/g'` isn't working? It gives illegal byte sequence error. I've tried converting it using iconv but still no progress. – Dhruv Apr 10 '23 at 18:59
  • @Dhruv, I don't see this error; I suggest you ask a new question with a [mcve]. – mklement0 Apr 10 '23 at 19:53
  • I could ask a new question, but the minimum reproducible example is the code in previous comment itself. I get the error `RE error: illegal byte sequence` on Mac terminal. – Dhruv Apr 11 '23 at 05:35
  • @Dhruv, it doesn't reproduce the error for me (my locale is`en_US.UTF-8`). – mklement0 Apr 11 '23 at 07:59
  • Did you copy the string in echo, it has a different character `Â` – Dhruv Apr 12 '23 at 11:36
  • @Dhruv, yes, I copied and pasted. With a UTF-8-based locale, I wouldn't know how to reproduce the error that way (pasting _literal text_). You need a _byte_ source that contains invalid-as-UTF-8 bytes or byte sequences to provoke the error. You can try `echo "HÂBnc" | od -c`, and if you see a _number_ in lieu of an accented character, then you have an invalid byte sequence. With text input, I can only provoke that if I use an ANSI C-quoted string that includes an escape sequence for an invalid-as-UTF-8 byte value, e.g. `printf $'h\xFC' | sed 's/./x/g'` – mklement0 Apr 12 '23 at 12:38
  • I saw that there is no error in my UTF-8 encoding of the character but it produces error only for specific regex patterns. I've converted the string to ANSI C-quoted string and the `E` option in sed is to indicate to use advanced regex (basically enable the use of `+` in regex pattern). The following two regex patterns work fine, `echo $'H\xc3\x82Bnc' | sed -E 's/[A-Z]+/\`&\`/g'` gives `\`H\`Â\`B\`nc` and `echo $'H\xc3\x82Bnc' | sed 's/.*/\`&\`/g'` gives `\`HÂBnc\``. But the sequence `echo $'H\xc3\x82Bnc' | sed -E 's/[A-Z]*/\`&\`/g'` gives error. – Dhruv Apr 13 '23 at 07:49
  • This is the output of my `locale` - `LANG=""`, `LC_CTYPE="UTF-8"`. Rest all have value `C`. However when I set `LANG` to `UTF-8` the command works. `echo $'H\xc3\x82Bnc' | LANG="UTF-8" sed -E 's/[A-Z]*/\`&\`/g'` gives `\`H\`?\`\`?\`B\`n\`\`c\`\``. I think the empty string that matched between the two bytes of the UTF-8 character were modified and `\`` was added in between which produced the illegal byte sequence error. I don't understand what setting the `LANG` did? – Dhruv Apr 13 '23 at 08:48
  • @Dhruv: A `LANG` value provides a _default value_ for all `LC_*` values - _except_ if you set `LC_ALL`, which overrides all other `LC_*` values, as well as `LANG`. `UTF-8` _alone_ is _not_ a valid locale identifier, so `LANG="UTF-8"` makes the `LC_*` values (except `LC_ALL`) _fall back_ to `"C"`. Use `locale -a` to see all valid locale identifiers available on your system. – mklement0 Apr 13 '23 at 23:26
  • Oh ok. The command works when I set `LC_COLLATE` to `en_GB.UTF-8`. `echo $'H\xc3\x82Bnc' | LC_COLLATE="en_GB.UTF-8" sed -E 's/[A-Z]*/\`&\`/g' ` produces `\`HÂB\`n\`\`c\`\``. So can you explain what is LC_COLLATE doing (earlier it was set to C) – Dhruv Apr 14 '23 at 12:03
177

Add the following lines to your ~/.bash_profile or ~/.zshrc file(s).

export LC_CTYPE=C 
export LANG=C
binarytemple_picsolve
  • 2,546
  • 2
  • 15
  • 15
  • 34
    it actually works, but could you please explain why? – Hoang Pham Feb 04 '14 at 17:54
  • I tried setting these variables to both be `en_GB.UTF-8` (which is what i export to LANG already in my .bash_profile) and get the same error. What is "C" here? – Max Williams Mar 21 '14 at 13:08
  • Here is the best documentation I was able to find about `LC_CTYPE`:http://www.delorie.com/gnu/docs/gawk/gawk_149.html – Jason Sperske Apr 08 '14 at 18:32
  • 17
    @HoangPham: Setting `LC_CTYPE` to `C` causes each byte in strings to be its own character without applying any encoding rules. Since a violation of (UTF-8) encoding rules caused the original problem, this makes the problem go away. However, the price you pay is that the shell and utilities then only recognize the basic English letters (the ones in the 7-bit ASCII range) as letters. See my answer for more. – mklement0 May 10 '14 at 18:37
  • 10
    Setting this permanently in your shell's startup files will disable many useful behaviors. You want to put this in only for individual commands which absolutely require it. – tripleee Jan 18 '16 at 08:16
  • 8
    Too dangerous can may cause unexpected consequences. One could use `LC_CTYPE=C sed …`, i.e. only on the sed command. – Yongwei Wu Mar 08 '18 at 06:40
  • 5
    This will completely disable support for Unicode characters in your shell. Goodbye emojis, fancy line drawing characters, letters with accents, .... Much better to just set this for the sed command only, as described in other answers. – asmeurer Apr 03 '18 at 21:30
23

My workaround had been using Perl:

find . -type f -print0 | xargs -0 perl -pi -e 's/was/now/g'
Vitaly Zdanevich
  • 13,032
  • 8
  • 47
  • 81
  • 4
    This one works great. And i've had no errors escaping special characters unlike the others. The previous ones gave me issues like "sed: RE error: illegal byte sequence" or sed: 1: "path_to_file": invalid command code . – JMags1632 May 29 '20 at 04:08
  • 2
    Simple and no need for configurations etc. Love it. – Thanos Jan 07 '22 at 14:16
5

You simply have to pipe an iconv command before the sed command. Ex with file.txt input :

iconv -f ISO-8859-1 -t UTF8-MAC file.txt | sed 's/something/àéèêçùû/g' | .....

-f option is the 'from' codeset and -t option is the 'to' codeset conversion.

Take care of case, web pages usually show lowercase like that < charset=iso-8859-1"/> and iconv uses uppercase. You have list of iconv supported codesets in you system with command iconv -l

UTF8-MAC is modern OS Mac codeset for conversion.

  • 1
    Also see [iconv and charset names](https://lists.gnu.org/archive/html/bug-gnu-libiconv/2019-05/msg00004.html) on the iconv mailing list. – jww May 10 '19 at 17:53
4

mklement0's answer is great, but I have some small tweaks.

It seems like a good idea to explicitly specify bash's encoding when using iconv. Also, we should prepend a byte-order mark (even though the unicode standard doesn't recommend it) because there can be legitimate confusions between UTF-8 and ASCII without a byte-order mark. Unfortunately, iconv doesn't prepend a byte-order mark when you explicitly specify an endianness (UTF-16BE or UTF-16LE), so we need to use UTF-16, which uses platform-specific endianness, and then use file --mime-encoding to discover the true endianness iconv used.

(I uppercase all my encodings because when you list all of iconv's supported encodings with iconv -l they are all uppercase.)

# Find out MY_FILE's encoding
# We'll convert back to this at the end
FILE_ENCODING="$( file --brief --mime-encoding MY_FILE )"
# Find out bash's encoding, with which we should encode
# MY_FILE so sed doesn't fail with 
# sed: RE error: illegal byte sequence
BASH_ENCODING="$( locale charmap | tr [:lower:] [:upper:] )"
# Convert to UTF-16 (unknown endianness) so iconv ensures
# we have a byte-order mark
iconv -f "$FILE_ENCODING" -t UTF-16 MY_FILE > MY_FILE.utf16_encoding
# Whether we're using UTF-16BE or UTF-16LE
UTF16_ENCODING="$( file --brief --mime-encoding MY_FILE.utf16_encoding )"
# Now we can use MY_FILE.bash_encoding with sed
iconv -f "$UTF16_ENCODING" -t "$BASH_ENCODING" MY_FILE.utf16_encoding > MY_FILE.bash_encoding
# sed!
sed 's/.*/&/' MY_FILE.bash_encoding > MY_FILE_SEDDED.bash_encoding
# now convert MY_FILE_SEDDED.bash_encoding back to its original encoding
iconv -f "$BASH_ENCODING" -t "$FILE_ENCODING" MY_FILE_SEDDED.bash_encoding > MY_FILE_SEDDED
# Now MY_FILE_SEDDED has been processed by sed, and is in the same encoding as MY_FILE
Community
  • 1
  • 1
Heath Borders
  • 30,998
  • 16
  • 147
  • 256
  • 1
    ++ for helpful techniques, especially `file -b --mime-encoding` for discovering and reporting a file's encoding. There are some aspects worth addressing, however, which I'll do in separate comments. – mklement0 Jan 28 '16 at 06:07
  • 3
    I think it's safe to say that the Unix world has embraced UTF-8 at this point: the default `LC_CTYPE` value is usually `.UTF-8`, so any file _without_ a BOM (byte-order mark) is therefore interpreted as a UTF-8 file. It is only in the _Windows_ world that the _pseudo-BOM_ `0xef 0xbb 0xff` is used; by definition, UTF-8 does not _need_ a BOM and is not recommended (as you state); outside the Windows world, this pseudo-BOM causes things to _break_. – mklement0 Jan 28 '16 at 06:08
  • 3
    Re `Unfortunately, iconv doesn't prepend a byte-order mark when you explicitly specify an endianness (UTF-16BE or UTF-16LE)`: that is by design: if you specify the endianness _explicitly_, there's no need to also reflect it via a BOM, so none is added. – mklement0 Jan 28 '16 at 06:09
  • A nit-pick: it's not _Bash's_ encoding; it's the encoding associated with the _current locale_, which is based on _environment variables_, and thus shell-independent; utility `locale` will show you what locale is in effect, expressed in terms of the `LANG` and `LC_*` environment variables (and `locale charmap`, as you demonstrate, will print the character encoding in effect). – mklement0 Jan 28 '16 at 06:09
  • I'm running OSX 10.10.5 and 10.11.3, and for both of them `file` returns ASCII for plain text, but UTF-8 if I put an emoji in the file. However, if I add a BOM, `file` always returns UTF-8. – Heath Borders Jan 28 '16 at 17:04
  • I didn't realize the LC variables were used by other shells. – Heath Borders Jan 28 '16 at 17:04
  • 2
    Re `LC_*` / `LANG` variables: `bash`, `ksh`, and `zsh` (possibly others, but _not_ `dash`) do respect the character encoding; verify in POSIX-like shells with an UTF-8-based locale with `v='ä'; echo "${#v}"`: a UTF-8 aware shell should report `1`; i.e., it should recognize the multi-byte sequence `ä` (`0xc3 0xa4`), as a _single_ character. Perhaps even more importantly, however: the _standard utilities_ (`sed`, `awk`, `cut`, ...) also need to be locale/encoding-aware, and while _most_ of them on modern Unix-like platforms are, there are exceptions, such as `awk` on OSX, and `cut` on Linux. – mklement0 Jan 28 '16 at 19:40
  • 2
    It's commendable that `file` recognizes the UTF-8 pseudo-BOM, but the problem is that most Unix utilities that process file do _not_, and usually break or at least misbehave when faced with one. Without a BOM, `file` correctly identifies an all-7-bit bytes file as ASCII, and one that has valid UTF-8 multi-byte characters as UTF-8. The beauty of UTF-8 is that it is a _superset_ of ASCII: any valid ASCII file is by definition a valid UTF-8 file (but not vice versa); it's perfectly to safe to treat an ASCII file as UTF-8 (which it technically is, it just happens to contain no multi-byte chars.) – mklement0 Jan 28 '16 at 19:46
2

Does anyone know how to get sed to print the position of the illegal byte sequence? Or does anyone know what the illegal byte sequence is?

$ uname -a
Darwin Adams-iMac 18.7.0 Darwin Kernel Version 18.7.0: Tue Aug 20 16:57:14 PDT 2019; root:xnu-4903.271.2~2/RELEASE_X86_64 x86_64

I got part of the way to answering the above just by using tr.

I have a .csv file that is a credit card statement and I am trying to import it into Gnucash. I am based in Switzerland so I have to deal with words like Zürich. Suspecting Gnucash does not like " " in numeric fields, I decide to simply replace all

; ;

with

;;

Here goes:

$ head -3 Auswertungen.csv | tail -1 | sed -e 's/; ;/;;/g'
sed: RE error: illegal byte sequence

I used od to shed some light: Note the 374 halfway down this od -c output

$ head -3 Auswertungen.csv | tail -1 | od -c
0000000    1   6   8   7       9   6   1   9       7   1   2   2   ;   5
0000020    4   6   8       8   7   X   X       X   X   X   X       2   6
0000040    6   0   ;   M   Y       N   A   M   E       I   S   X   ;   1
0000060    4   .   0   2   .   2   0   1   9   ;   9   5   5   2       -
0000100        M   i   t   a   r   b   e   i   t   e   r   r   e   s   t
0000120                Z 374   r   i   c   h                            
0000140    C   H   E   ;   R   e   s   t   a   u   r   a   n   t   s   ,
0000160        B   a   r   s   ;   6   .   2   0   ;   C   H   F   ;    
0000200    ;   C   H   F   ;   6   .   2   0   ;       ;   1   5   .   0
0000220    2   .   2   0   1   9  \n                                    
0000227

Then I thought I might try to persuade tr to substitute 374 for whatever the correct byte code is. So first I tried something simple, which didn't work, but had the side effect of showing me where the troublesome byte was:

$ head -3 Auswertungen.csv | tail -1 | tr . .  ; echo
tr: Illegal byte sequence
1687 9619 7122;5468 87XX XXXX 2660;MY NAME ISX;14.02.2019;9552 - Mitarbeiterrest   Z

You can see tr bails at the 374 character.

Using perl seems to avoid this problem

$ head -3 Auswertungen.csv | tail -1 | perl -pne 's/; ;/;;/g'
1687 9619 7122;5468 87XX XXXX 2660;ADAM NEALIS;14.02.2019;9552 - Mitarbeiterrest   Z?rich       CHE;Restaurants, Bars;6.20;CHF;;CHF;6.20;;15.02.2019
1

My workaround had been using gnu sed. Worked fine for my purposes.

lu_zero
  • 978
  • 10
  • 14
  • 1
    Indeed, _GNU_ `sed` is an option if you want to _ignore_ invalid bytes in the input stream (no need for the `LC_ALL=C sed ...` workaround), because GNU `sed` simply _passes invalid bytes through_ instead of reporting an error, but note that if you want to properly recognize and process all characters in the input string, there is no way around changing the input's encoding first (typically, with `iconv`). – mklement0 Dec 09 '16 at 04:08
0

For me, this issue was rooted in the command attempting to open/edit .DS_Store files. Removing those resolved it for me.

onassar
  • 3,313
  • 7
  • 36
  • 58