Using awk to remove the Byte-order mark

Question

How would an awk script (presumably a one-liner) for removing a BOM look like?

Specification:

print every line after the first (NR > 1)
for the first line: If it starts with #FE #FF or #FF #FE, remove those and print the rest

score 127 · Answer 1 · edited Sep 08 '17 at 09:03

127

Using GNU sed (on Linux or Cygwin):

# Removing BOM from all text files in current directory:
sed -i '1 s/^\xef\xbb\xbf//' *.txt

On FreeBSD:

sed -i .bak '1 s/^\xef\xbb\xbf//' *.txt

Advantage of using GNU or FreeBSD sed: the -i parameter means "in place", and will update files without the need for redirections or weird tricks.

On Mac:

This awk solution in another answer works, but the sed command above does not work. At least on Mac (Sierra) sed documentation does not mention supporting hexadecimal escaping ala \xef.

A similar trick can be achieved with any program by piping to the sponge tool from moreutils:

awk '…' INFILE | sponge INFILE

edited Sep 08 '17 at 09:03

Peter Lamberg

8,151
3
55
69

answered Sep 01 '10 at 21:06

Denilson Sá Maia

47,466
33
109
111

5

I tried the second command precisely on Mac OS X and the result was "success", but the substitution didn't actually occur. – Hakanai Dec 06 '12 at 05:52
1

It is worth noting these commands replace one specific byte sequence, which is [one of the possible byte-order-marks](http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding). Maybe your file had a different BOM sequence. (I can't help other than that, as I don't have a Mac) – Denilson Sá Maia Dec 07 '12 at 17:04
3

When I tried the second command on OS X on a file that used 0xef 0xbb 0xbf as the BOM, it did not actually do the substitution. – John Wiseman Oct 13 '15 at 20:33
In OSX, I could only get this to work via perl, as shown here: http://stackoverflow.com/a/9101056/2063546 – Ian Aug 19 '16 at 18:41
On OS X El Capitan `10.11.6`, this doesn't work, but the official answer http://stackoverflow.com/a/1068700/9636 works fine. – Heath Borders Sep 13 '16 at 15:54

score 118 · Accepted Answer · edited Sep 13 '16 at 16:00

118

Try this:

awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}{print}' INFILE > OUTFILE

On the first record (line), remove the BOM characters. Print every record.

Or slightly shorter, using the knowledge that the default action in awk is to print the record:

awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}1' INFILE > OUTFILE

1 is the shortest condition that always evaluates to true, so each record is printed.

Enjoy!

-- ADDENDUM --

Unicode Byte Order Mark (BOM) FAQ includes the following table listing the exact BOM bytes for each encoding:

Bytes         |  Encoding Form
--------------------------------------
00 00 FE FF   |  UTF-32, big-endian
FF FE 00 00   |  UTF-32, little-endian
FE FF         |  UTF-16, big-endian
FF FE         |  UTF-16, little-endian
EF BB BF      |  UTF-8

Thus, you can see how \xef\xbb\xbf corresponds to EF BB BF UTF-8 BOM bytes from the above table.

edited Sep 13 '16 at 16:00

Heath Borders

30,998
16
147
256

answered Jul 01 '09 at 11:45

Bartosz

6,055
3
27
17

1

It seems that the dot in the middle of the sub statement is too much (at least, my awk complains about it). Beside this it's exactly what I searched, thanks! – Boldewyn Jul 01 '09 at 12:21
5

This solution, however, works **only** for UTF-8 encoded files. For others, like UTF-16, see Wikipedia for the corresponding BOM representation: http://en.wikipedia.org/wiki/Byte_order_mark – Boldewyn Jul 01 '09 at 12:36
I agree with the earlier comment; the dot does not belong in the middle of this statement and makes this otherwise great little snippet an example of an awk syntax error. – Brandon Rhodes Dec 08 '09 at 14:37
2

So: `awk '{if(NR==1)sub(/^\xef\xbb\xbf/,"");print}' INFILE > OUTFILE` and make sure INFILE and OUTFILE are different! – Steve Clay Feb 12 '10 at 20:30
1

If you used `perl -i.orig -pe 's/^\x{FFFE}//' badfile` you could rely on your PERL_UNICODE and/or PERLIO envariables for the encoding. PERL_UNICODE=SD would work for UTF-8; for the others, you’d need PERLIO. – tchrist Aug 14 '11 at 23:38
1

Maybe a little bit shorter version: `awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}1'` – TrueY Jun 06 '13 at 10:02
Works great on OS X El Capitan `10.11.6`. – Heath Borders Sep 13 '16 at 15:41
/!\ Both commands erased my file, as a "side-effect" of changing the encoding... Quite fortunate to have had them backed up first. – OroshiX Jul 20 '18 at 12:48
If you're trying to just change the file (not create a new one) and for some reason can't use sed (as per the answer below), make sure to use `-i inplace` and not put the input file as the output file, which will erase the file! – Blayzeing Nov 26 '20 at 12:12

score 42 · Answer 3 · answered Feb 15 '10 at 20:07

42

Not awk, but simpler:

tail -c +4 UTF8 > UTF8.nobom

To check for BOM:

hd -n 3 UTF8

If BOM is present you'll see: 00000000 ef bb bf ...

answered Feb 15 '10 at 20:07

Steve Clay

8,671
2
42
48

6

BOMs are 2 bytes for UTF-16 and 4 bytes for UTF-32, and of course have no business being in UTF-8 in the first place. – tchrist Aug 14 '11 at 23:33
@tchrist: from wikipedia: "The Unicode Standard does permit the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8 so in UTF-8 the BOM serves only to **identify** a text stream or file as UTF-8." – Karoly Horvath Mar 17 '12 at 17:31
2

@KarolyHorvath Yes, precisely. Its use is not recommended. It breaks stuff. The encoding should be specified by a higher-level protocol. – tchrist Mar 17 '12 at 18:28
1

@tchrist: you mean it breaks broken stuff? :) proper apps should be able to handle that BOM. – Karoly Horvath Mar 17 '12 at 18:31
7

@KarolyHorvath I mean it *breaks **lots** of programs*. Isn’t that what I said? When you open a stream in the UTF-16 or UTF-32 encodings, the decoder knows not to count the BOM. When you use UTF-8, decoders present the BOM as data. This is a syntax error in innumerable programs. [Even Java’s decoder behaves this way, BY DESIGN!](http://bugs.sun.com/view_bug.do?bug_id=4508058) BOMs on UTF-8 files are misplaced and a pain in the butt: **they are an error!** They break many things. Even just `cat file1.utf8 file2.utf8 file3.utf3 > allfiles.utf8` will be broken. Never use a BOM on UTF-8. Period. – tchrist Mar 17 '12 at 18:51
@@tchrist:: that's what you said... and I said something else. BTW Java's decoder isn't broken by *design*. it's a *bug* and they kept it for backward compatibility. – Karoly Horvath Mar 17 '12 at 18:57
6

`hd` is not available on OS X (as of 10.8.2), so to check for an UTF-8 BOM there you can use the following: `head -c 3 file | od -t x1`. – mklement0 Oct 12 '12 at 22:43
1

`if [[ "`file a.txt | grep -o 'with BOM'`" == "BOM" ]];` can also be used – Benoit Duffez Oct 11 '14 at 14:34
1

`hexdump` and `xdd` should work in place of `hd`, if that's not available on your system. – Seldom 'Where's Monica' Needy May 15 '16 at 20:07

score 21 · Answer 4 · answered Sep 29 '13 at 12:43

In addition to converting CRLF line endings to LF, dos2unix also removes BOMs:

dos2unix *.txt

dos2unix also converts UTF-16 files with a BOM (but not UTF-16 files without a BOM) to UTF-8 without a BOM:

$ printf '\ufeffä\n'|iconv -f utf-8 -t utf-16be>bom-utf16be
$ printf '\ufeffä\n'|iconv -f utf-8 -t utf-16le>bom-utf16le
$ printf '\ufeffä\n'>bom-utf8
$ printf 'ä\n'|iconv -f utf-8 -t utf-16be>utf16be
$ printf 'ä\n'|iconv -f utf-8 -t utf-16le>utf16le
$ printf 'ä\n'>utf8
$ for f in *;do printf '%11s %s\n' $f $(xxd -p $f);done
bom-utf16be feff00e4000a
bom-utf16le fffee4000a00
   bom-utf8 efbbbfc3a40a
    utf16be 00e4000a
    utf16le e4000a00
       utf8 c3a40a
$ dos2unix -q *
$ for f in *;do printf '%11s %s\n' $f $(xxd -p $f);done
bom-utf16be c3a40a
bom-utf16le c3a40a
   bom-utf8 c3a40a
    utf16be 00e4000a
    utf16le e4000a00
       utf8 c3a40a

score 3 · Answer 5 · answered Mar 21 '12 at 10:20

I know the question was directed at unix/linux, thought it would be worth to mention a good option for the unix-challenged (on windows, with a UI).
I ran into the same issue on a WordPress project (BOM was causing problems with rss feed and page validation) and I had to look into all the files in a quite big directory tree to find the one that was with BOM. Found an application called Replace Pioneer and in it:

Batch Runner -> Search (to find all the files in the subfolders) -> Replace Template -> Binary remove BOM (there is a ready made search and replace template for this).

It was not the most elegant solution and it did require installing a program, which is a downside. But once I found out what was going around me, it worked like a charm (and found 3 files out of about 2300 that were with BOM).

I am so happy when I found your solution, however I don't have the privilege to install software on company computer. Took lots of time today, until I figure out the alternative: Using Notepad++ with PythonScript plugin . http://superuser.com/questions/418515/how-to-find-all-files-in-directory-that-contain-utf-8-bom-byte-order-mark/914116#914116 Thanks anyway! — Hoàng Long, May 13 '15 at 07:22

Using awk to remove the Byte-order mark

5 Answers5

Linked

Related