36

I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?

$ file test.xml
test.xml:  XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines
m13r
  • 2,458
  • 2
  • 29
  • 39

6 Answers6

51

Using VIM

  1. Open file in VIM:

     vi text.xml
    
  2. Remove BOM encoding:

     :set nobomb
    
  3. Save and quit:

     :wq
    

For a non-interactive solution, try the following command line:

vi -c ":set nobomb" -c ":wq" text.xml

That should remove the BOM, save the file and quit, all from the command line.

Joshua Pinter
  • 45,245
  • 23
  • 243
  • 245
  • 2
    Is there a way to make vim do that non-interactively? OP asked for a "command line" solution. – Doktor J Aug 22 '18 at 15:35
  • @DoktorJ If you find out, make sure to post a comment back here for others. – Joshua Pinter Aug 23 '18 at 16:26
  • 1
    It didn't work at all for me, because vim was expecting the wrong encoding. Fix is here: https://stackoverflow.com/questions/16507777/set-encoding-and-fileencoding-to-utf-8-in-vim (Perhaps only the file encoding needs to be set, but I set both, and I did it in my .vimrc) – Mike Maxwell Jan 24 '19 at 18:15
  • 1
    @DoktorJ Try using `-c` flag in `vi`, like: `vi -c ":set nobomb" -c ":wq" text.xml`. That should remove the BOM, save the file and quit all from the command line. – Joshua Pinter Aug 01 '20 at 15:32
38

A BOM is Unicode codepoint U+FEFF; the UTF-8 encoding consists of the three hex values 0xEF, 0xBB, 0xBF.

With bash, you can create a UTF-8 BOM with the $'' special quoting form, which implements Unicode escapes: $'\uFEFF'. So with bash, a reliable way of removing a UTF-8 BOM from the beginning of a text file would be:

sed -i $'1s/^\uFEFF//' file.txt

This will leave the file unchanged if it does not start with a UTF-8 BOM, and otherwise remove the BOM.

If you are using some other shell, you might find that "$(printf '\ufeff')" produces the BOM character (that works with zsh as well as any shell without a printf builtin, provided that /usr/bin/printf is the Gnu version ), but if you want a Posix-compatible version you could use:

sed "$(printf '1s/^\357\273\277//')" file.txt

(The -i in-place edit flag is also a Gnu extension; this version writes the possibly-modified file to stdout.)

rici
  • 234,347
  • 28
  • 237
  • 341
  • creating the BOM with bash on windows didn't seem to work, this did https://stackoverflow.com/a/3622153/1507124 – CervEd Sep 13 '21 at 10:53
  • Funny, my source file came from a Windows environment, so I was scratching my head why bash read "${line:0:1}" is returning an empty character until I used "od -c filename.txt" to check wether it has BOM "357 273 277" and it did. – Nik Mar 08 '23 at 08:26
18

Well, just dealt with this today and my preferred way was dos2unix:

dos2unix will remove BOM and also take care of other idiosyncrasies from other SOs:

$ sudo apt install dos2unix
$ dos2unix test.xml

It's also possible to remove BOM only (-r, --remove-bom):

$ dos2unix -r test.xml

Note: tested with dos2unix 7.3.4

Reginaldo Santos
  • 840
  • 11
  • 24
  • A SuSE user reports that their version of `dos2unix` does not do this. Not sure what the version number might be or if they are even from the same source. – tripleee Jan 12 '21 at 07:31
  • 1
    (I tried to repro just now with the `opensuse/archive` Docker image from https://hub.docker.com/r/opensuse/archive but that gives me `dos2unix` 7.3.4 which works just as advertised in this answer. Probably the OpenSUSE version I got was not ancient enough. `head -n 2 /etc/os-release` gets me `NAME="openSUSE Leap" VERSION="42.3"`) – tripleee Sep 09 '21 at 05:35
  • @tripleee : well .. I'm glad you made it and also glad I posted a very specific version in my answer :-) – Reginaldo Santos Sep 09 '21 at 15:45
  • 1
    The above answer (using sed) doesn't work on ZSH because the need for inline. This worked on a folder of 760 files in one command thank you. – Michael Brown Aug 11 '23 at 22:34
5

IF you are certain that a given file starts with a BOM, then it is possible to remove the BOM from a file with the tail command:

tail --bytes=+4 withBOM.txt > withoutBOM.txt
Tim Barrass
  • 4,813
  • 2
  • 29
  • 55
m13r
  • 2,458
  • 2
  • 29
  • 39
  • 11
    This should only be done if you *know* that the file begins with a UTF-8 encoded BOM. It's not a good idea as a general suggestion, because if the file did not begin with a BOM, or if the file was UTF-16 (or any other encoding) this will destroy the first few meaningful characters of the data. – rmunn Nov 07 '17 at 08:50
1

If you want to work on a bulk of files, by improving Reginaldo Santos's answers, there is a quick way:

find . -name "*.java" | grep java$ | xargs -n 1 dos2unix
0

Joshua Pinter's answer works correctly on mac so I wrote a script that removes the BOM from all files in a given folder, see here.

It can be used like follows:

Remove BOM from all files in current directory: rmbom .

Print all files with a BOM in the current directory: rmbom . -a

Only remove BOM from all files in current directory with extension txt or cs: rmbom . -e txt -e cs

Ludvig W
  • 744
  • 8
  • 27