Elegant way to search for UTF-8 files with BOM?

Question

For debugging purposes, I need to recursively search a directory for all files which start with a UTF-8 byte order mark (BOM). My current solution is a simple shell script:

find -type f |
while read file
do
    if [ "`head -c 3 -- "$file"`" == $'\xef\xbb\xbf' ]
    then
        echo "found BOM in: $file"
    fi
done

Or, if you prefer short, unreadable one-liners:

find -type f|while read file;do [ "`head -c3 -- "$file"`" == $'\xef\xbb\xbf' ] && echo "found BOM in: $file";done

It doesn't work with filenames that contain a line break, but such files are not to be expected anyway.

Is there any shorter or more elegant solution?

Are there any interesting text editors or macros for text editors?

score 192 · Accepted Answer · edited Jan 08 '20 at 17:51

192

What about this one simple command which not just finds but clears the nasty BOM? :)

find . -type f -exec sed '1s/^\xEF\xBB\xBF//' -i {} \;

I love "find" :)

Warning The above will modify binary files which contain those three characters.

If you want just to show BOM files, use this one:

grep -rl $'\xEF\xBB\xBF' .

edited Jan 08 '20 at 17:51

Agostino

2,723
9
48
65

answered May 18 '10 at 15:37

Denis

1,960
1
12
2

10

Incorrectly detects PDF with a BOM marker.. that's because it searches the whole document, not just the first line – Olivier Refalo Sep 23 '11 at 14:38
1

Or with ack: "ack '\xEF\xBB\xBF'" – Smar Mar 17 '12 at 01:46
5

change the sed command to add a 1 before the leading 's' so it only applies to the first line – Ben Combee Jun 06 '12 at 04:07
the grep is finding many binary files as well, fix it using something like ```egrep -rl $'^\xEF\xBB\xBF'```, and even this greps start of lines and not only first line. – Evgeny Sep 12 '12 at 11:45
33

Use `grep -rlI $'\xEF\xBB\xBF' .` to ignore binary files. – dbernard Nov 05 '12 at 20:07
1

Detects and modifies JPG and other binary files, as already said. – Jehy Jan 28 '14 at 10:38
What's the point of creating a ".bak" file by sed, just to remove it in the next "-exec"? I edited it to make sed simply fix the files in-place. See:https://stackoverflow.com/posts/2858757/revisions#revb6aca367-99db-4a9b-9b96-de857b9ad86a – vog Jun 17 '17 at 06:22
Is there a way to have it ignore the `.git` folder? – Aaron Franke Dec 13 '19 at 07:22

score 44 · Answer 2 · edited Mar 27 '17 at 12:17

44

The best and easiest way to do this on Windows:

Total Commander → go to project's root dir → find files (Alt + F7) → file types *.* → Find text "EF BB BF" → check 'Hex' checkbox → search

And you get the list :)

edited Mar 27 '17 at 12:17

doppelgreener

4,809
10
46
63

answered Sep 19 '11 at 23:06

Jan Przybylo

449
4
2

7

Nice, especially the use of my long time favorite Total commander, but unfortunately this suffers the same issue as many others: it searches all bytes in a fle, so many images etc are reported. This can be slightly improved by using RegEx instead of Hex and searching for "^\xEF\xBB\xBF" which will eliminate many images but still has files that have the BOM halfway through the file (although there should be few) and of course any binary files that happen to have an ascii newline charcode just beofre the BOM. Still, all images were gone in my test search. – Legolas Sep 08 '15 at 13:26

score 16 · Answer 3 · answered May 21 '10 at 19:22

16

find . -type f -print0 | xargs -0r awk '
    /^\xEF\xBB\xBF/ {print FILENAME}
    {nextfile}'

Most of the solutions given above test more than the first line of the file, even if some (such as Marcus's solution) then filter the results. This solution only tests the first line of each file so it should be a bit quicker.

answered May 21 '10 at 19:22

Aron Griffis

1,699
12
19

2

Got is working with the following on Linux (RHEL6) - `find . -type f -print0 | xargs -0 awk '/^\xEF\xBB\xBF/ {print FILENAME} {nextfile}'` – Olivier Refalo Sep 23 '11 at 14:37
How do I have to modify your code to fix these files after they were found? – Black Aug 05 '19 at 09:55

CesarB · Answer 4 · 2008-10-19T03:52:29.660

9

If you accept some false positives (in case there are non-text files, or in the unlikely case there is a ZWNBSP in the middle of a file), you can use grep:

fgrep -rl `echo -ne '\xef\xbb\xbf'` .

edited Oct 19 '08 at 03:52

answered Oct 17 '08 at 11:55

CesarB

43,947
7
63
86

score 6 · Answer 5 · answered Jul 12 '13 at 21:16

6

You can use grep to find them and Perl to strip them out like so:

grep -rl $'\xEF\xBB\xBF' . | xargs perl -i -pe 's{\xEF\xBB\xBF}{}'

answered Jul 12 '13 at 21:16

theory

9,178
10
59
129

This one worked for me, the accepted answer didn't (I'm on a Mac) – mjsarfatti Mar 31 '16 at 15:18

score 5 · Answer 6 · answered Oct 17 '08 at 14:12

5

I would use something like:

grep -orHbm1 "^`echo -ne '\xef\xbb\xbf'`" . | sed '/:0:/!d;s/:0:.*//'

Which will ensure that the BOM occurs starting at the first byte of the file.

answered Oct 17 '08 at 14:12

Marcus Griep

8,156
1
23
24

score 4 · Answer 7 · edited May 06 '15 at 19:51

4

For a Windows user, see this (good PHP script for finding the BOM in your project).

edited May 06 '15 at 19:51

Peter Mortensen

30,738
21
105
131

answered Nov 03 '11 at 09:34

julien

59
1

The linked website shows: "Website Offline, No Cached Version Available". – vog Jan 09 '12 at 12:57
same script is also available in github: http://github.com/emrahgunduz/BomCleaner – emrahgunduz Apr 16 '13 at 15:26
Thanks buddy, Your answer saved my day. – Krunal Panchal Sep 21 '15 at 08:09
And a BOM Finder: https://github.com/svn2github/wikia/blob/master/extensions/FCKeditor/fckeditor/_dev/php_bom_finder.php (in case someone don't like the 'automatic' cleaning, or just wants to find the files with BOM) – meloniq May 18 '16 at 16:43

score 3 · Answer 8 · answered Dec 21 '11 at 01:55

An overkill solution to this is phptags (not the vi tool with the same name), which specifically looks for PHP scripts:

phptags --warn ./

Will output something like:

./invalid.php: TRAILING whitespace ("?>\n")
./invalid.php: UTF-8 BOM alone ("\xEF\xBB\xBF")

And the --whitespace mode will automatically fix such issues (recursively, but asserts that it only rewrites .php scripts.)

score 3 · Answer 9 · edited May 06 '15 at 19:52

3

I used this to correct only JavaScript files:

find . -iname *.js -type f -exec sed 's/^\xEF\xBB\xBF//' -i.bak {} \; -exec rm {}.bak \;

edited May 06 '15 at 19:52

Peter Mortensen

30,738
21
105
131

answered Apr 03 '12 at 09:05

LLub

135
7

score 2 · Answer 10 · answered Oct 17 '08 at 13:51

2

find -type f -print0 | xargs -0 grep -l `printf '^\xef\xbb\xbf'` | sed 's/^/found BOM in: /'

find -print0 puts a null \0 between each file name instead of using new lines
xargs -0 expects null separated arguments instead of line separated
grep -l lists the files which match the regex
The regex ^\xeff\xbb\xbf isn't entirely correct, as it will match non-BOMed UTF-8 files if they have zero width spaces at the start of a line

answered Oct 17 '08 at 13:51

Jonathan Wright

14,880
4
24
19

1

You still need a "head 1" in the pipe before the grep – MSalters Oct 17 '08 at 14:08

score 1 · Answer 11 · answered Oct 16 '14 at 14:28

If you are looking for UTF files, the file command works. It will tell you what the encoding of the file is. If there are any non ASCII characters in there it will come up with UTF.

file *.php | grep UTF

That won't work recursively though. You can probably rig up some fancy command to make it recursive, but I just searched each level individually like the following, until I ran out of levels.

file */*.php | grep UTF

Elegant way to search for UTF-8 files with BOM?

11 Answers11

Linked

Related