findstr or grep that autodetects chararacter encoding (UTF-16)

Question

I want to do this:

 findstr /s /c:some-symbol *

or the grep equivalent

 grep -R some-symbol *

but I need the utility to autodetect files encoded in UTF-16 (and friends) and search them appropriately. My files even have the byte-ordering mark FFEE in them so I'm not even looking for heroic autodetection.

Any suggestions?

I'm referring to Windows Vista and XP.

Are some of your files in UTF-16 and some in ASCII, or what? — Artelius, Jan 02 '09 at 21:38

score 5 · Answer 1 · answered Sep 04 '12 at 15:58

5

A workaround is to convert your UTF-16 to ASCII or ANSI

TYPE UTF-16.txt > ASCII.txt

Then you can use FINDSTR.

FINDSTR object ASCII.txt

answered Sep 04 '12 at 15:58

PollusB

1,726
2
22
31

1

...pardon, what? – user541686 Feb 01 '18 at 01:13
or `type UTF-16.txt | findstr` if you don't need filename (OP needs filename because searching multiple files, but some may find this useful) – yoyo May 13 '21 at 03:06

score 4 · Answer 2 · edited Nov 29 '16 at 04:27

4

Thanks for the suggestions. I was referring to Windows Vista and XP.

I also discovered this workaround, using free Sysinternals strings.exe:

C:\> strings -s -b dir_tree_to_search | grep regexp

Strings.exe extracts all of the strings it finds (from binaries, but works fine with text files too) and prepends each result with a filename and colon, so take that into account in the regexp (or use cut or another step in the pipeline). The -s makes it do a recursive extraction and -b just suppresses the banner message.

Ultimately I'm still kind of surprised that the flagship searching utilities Gnu grep and findstr don't handle Unicode character encodings natively.

edited Nov 29 '16 at 04:27

phuclv

37,963
15
156
475

answered Jan 04 '09 at 13:55

David Martin

181
1
2
7

On their home unix environments, UTF-16 is much less common, and files are generally in UTF-8, which they handle just fine. – bdonlan May 17 '09 at 21:13
Maybe not so great for extracting the whole line, but perfect for trying to find all files containing a string (which I'm trying to do). Thanks. – Kevin Shea Oct 27 '15 at 12:01

score 3 · Answer 3 · answered Jan 09 '13 at 19:23

3

findstr /s /c:some-symbol *

can be replaced with the following character encoding aware command:

for /r %f in (*) do @find /i /n "some-symbol" "%f"

answered Jan 09 '13 at 19:23

Shameer

307
1
10

1

If add Venkateshwar's answer below, you get: for /r %f in (*) do @find /i /n "some-symbol" "%f" | findstr /i "some-symbol" which will filter out the filenames. I found this useful when searching a set of files looking for "Fail". I didn't care what file it appeared in, I just wanted to see if any file had "Fail" in it. – Eli Nov 04 '13 at 15:36

score 3 · Answer 4 · answered Jul 21 '11 at 20:31

3

On Windows, you can also use find.exe.

find /i /n "YourSearchString" *.*

The only problem is this prints file names followed by matches. You may filter them by piping to findstr

find /i /n "YourSearchString" *.* | findstr /i "YourSearchString"

answered Jul 21 '11 at 20:31

vent

1,033
10
21

Unfortunately find command doesn't support matching patterns like findstr (wildcards / regular expressions). – Mister_Tom May 10 '16 at 17:55

score 1 · Answer 5 · answered Jan 02 '09 at 22:22

According to this blog article by Damon Cortesi grep doesn't work with UTF-16 files, as you found out. However, it presents this work-around:

for f in `find . -type f | xargs -I {} file {} | grep UTF-16 | cut -f1 -d\:`
        do iconv -f UTF-16 -t UTF-8 $f | grep -iH --label=$f ${GREP_FOR}
done

This is obviously for Unix, not sure what the equivalent on Windows would be. The author of that article also provides a shell-script to do the above that you can find on github here.

This only greps files that are UTF-16. You'd also grep your ASCII files the normal way.

score 1 · Answer 6 · answered Jan 24 '17 at 22:46

1

In higher versions of Windows, UTF-16 is supported out-of-box. If not, try changing active code page by chcp command.

In my case when using findstr alone was failing for UTF-16 files, however it worked with type:

type *.* | findstr /s /c:some-symbol

answered Jan 24 '17 at 22:46

kenorb

155,785
88
678
743

score 0 · Answer 7 · answered Jan 03 '09 at 14:50

You didn't say which platform you want to do this on.

On Windows, you could use PowerGREP, which automatically detects Unicode files that start with a byte order mark. (There's also an option to auto-detect files without a BOM. The auto-detection is very reliable for UTF-8, but limited for UTF-16.)

findstr or grep that autodetects chararacter encoding (UTF-16)

7 Answers7

Linked