grepping binary files and UTF16

Question

Standard grep/pcregrep etc. can conveniently be used with binary files for ASCII or UTF8 data - is there a simple way to make them try UTF16 too (preferably simultaneously, but instead will do)?

Data I'm trying to get is all ASCII anyway (references in libraries etc.), it just doesn't get found as sometimes there's 00 between any two characters, and sometimes there isn't.

I don't see any way to get it done semantically, but these 00s should do the trick, except I cannot easily use them on command line.

I mean ASCII range of characters (U+0000 to U+007F), not ASCII encoding. — taw, Sep 20 '10 at 20:27
If the utf-16 file has its BOM, grep should not complain. Maybe you should pre inject the BOM? `{ printf "\xFF\xFE"; cat my-utf16-no-bom.txt; } | grep ...` — Sandburg, Apr 13 '22 at 08:08

Niki Yoshiuchi · Accepted Answer · 2010-09-24T15:58:46.083

88

The easiest way is to just convert the text file to utf-8 and pipe that to grep:

iconv -f utf-16 -t utf-8 file.txt | grep query

I tried to do the opposite (convert my query to utf-16) but it seems as though grep doesn't like that. I think it might have to do with endianness, but I'm not sure.

It seems as though grep will convert a query that is utf-16 to utf-8/ascii. Here is what I tried:

grep `echo -n query | iconv -f utf-8 -t utf-16 | sed 's/..//'` test.txt

If test.txt is a utf-16 file this won't work, but it does work if test.txt is ascii. I can only conclude that grep is converting my query to ascii.

EDIT: Here's a really really crazy one that kind of works but doesn't give you very much useful info:

hexdump -e '/1 "%02x"' test.txt | grep -P `echo -n Test | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "%02x"'`

How does it work? Well it converts your file to hex (without any extra formatting that hexdump usually applies). It pipes that into grep. Grep is using a query that is constructed by echoing your query (without a newline) into iconv which converts it to utf-16. This is then piped into sed to remove the BOM (the first two bytes of a utf-16 file used to determine endianness). This is then piped into hexdump so that the query and the input are the same.

Unfortunately I think this will end up printing out the ENTIRE file if there is a single match. Also this won't work if the utf-16 in your binary file is stored in a different endianness than your machine.

EDIT2: Got it!!!!

grep -P `echo -n "Test" | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "x%02x"' | sed 's/x/\\\\x/g'` test.txt

This searches for the hex version of the string Test (in utf-16) in the file test.txt

edited Sep 24 '10 at 15:58

answered Sep 23 '10 at 18:01

Niki Yoshiuchi

16,883
1
35
44

5

`iconv` won't not work, as it's a binary file a lot of non-utf-16 data, and `iconv` exits on first error. – taw Sep 24 '10 at 13:27
Ouch...I'm still looking into giving grep a utf-16 query out of curiosity (I don't think it's converting because it doesn't really know the encoding, it's gotta be doing something else weird) and I'll let you know if I come up with something. – Niki Yoshiuchi Sep 24 '10 at 14:23
1

It seems to be working after minor modification: `pcregrep \`echo -n "test" | iconv -f utf-8 -t utf-16le | hexdump -e '/1 "x%02x"' | sed 's/x/\\\\x/g'\` – taw Sep 27 '10 at 07:39
Awesome! I discovered that the problem I was having was with the backticks. For some reason they return utf-8 strings, and escape the backslashes. This is why sed has four '\'s. – Niki Yoshiuchi Sep 27 '10 at 14:11
I needed to do this for saved Windows registry files. I found that the above command was great but when I needed to also know the file name I created a new method in bash to use: Define grepreg in bash by pasting at bash command prompt: grepreg () { find -name '*.reg' -exec echo {} \; -exec iconv -f utf-16 -t utf-8 {} \; | grep "$1\|\.reg" } // Sample usage: grepreg SampleTextToSearchForInFiles – Andrew Stern Oct 16 '15 at 13:42
1

Some sed is unicode aware, then it will strip the first two chars after the unicode markers and not the marker chars. Replace sed 's/..//' with tail -c +3 – Jason Pyeron Nov 25 '16 at 11:51
Not wanting to deal with all these issues, we replaced grep with ugrep that actually matches Unicode, automatically converts UTF-16/32 files, and displays text and hexdumps https://github.com/Genivia/RE-flex . Niki Yoshiuchi nice and creative answer! – Dr. Alex RE Jul 12 '19 at 15:13

score 26 · Answer 2 · answered Mar 01 '18 at 22:09

I found the below solution worked best for me, from https://www.splitbits.com/2015/11/11/tip-grep-and-unicode/

Grep does not play well with Unicode, but it can be worked around. For example, to find,

Some Search Term

in a UTF-16 file, use a regular expression to ignore the first byte in each character,

S.o.m.e. .S.e.a.r.c.h. .T.e.r.m

Also, tell grep to treat the file as text, using '-a', the final command looks like this,

grep -a 'S.o.m.e. .S.e.a.r.c.h. .T.e.r.m' utf-16-file.txt

Ethan Bradford · Answer 3 · 2015-11-10T02:33:46.263

17

You can explicitly include the nulls (00s) in the search string, though you will get results with nulls, so you may want to redirect the output to a file so you can look at it with a reasonable editor, or pipe it through sed to replace the nulls. To search for "bar" in *.utf16.txt:

grep -Pa "b\x00a\x00r" *.utf16.txt | sed 's/\x00//g'

The "-P" tells grep to accept Perl regexp syntax, which allows \x00 to expand to null, and the -a tells it to ignore the fact that Unicode looks like binary to it.

edited Nov 10 '15 at 02:33

answered Nov 10 '15 at 02:28

Ethan Bradford

710
8
10

1

Good technique, I didn't think of this. The `-a` flag to grep is the magic here. presuming that you don't have huge files to search (in which case this might be too slow), you can make it a bit easier to type by just specifying `.` instead of `\x00`. The `.` will match anything, not just a null. That might not always be what you want but probably most of the time will be fine. Often, the sed to clear out the nulls is not necessary, either - they don't print anything on output. So for your example, just `grep -a b.a.r *.utf16.txt` should work. – Dan Pritts Dec 23 '15 at 22:11
I must try to remember the `-P` option to allow for `\xnn`. The way i did it without perl was using "." i.e. any single char and how @nirmal answered below – northern-bradley Jun 30 '18 at 07:05

score 9 · Answer 4 · answered Aug 29 '14 at 23:11

I use this one all the time after dumping the Windows registry as its output is unicode. This is running under Cygwin.

$ regedit /e registry.data.out
$ file registry.data.out
registry.data.out: Little-endian **UTF-16 Unicode text**, with CRLF line terminators

$ sed 's/\x00//g' registry.data.out | egrep "192\.168"
"Port"="192.168.1.5"
"IPSubnetAddress"="192.168.189.0"
"IPSubnetAddress"="192.168.102.0"
[HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Control\Print\Monitors\Standard TCP/IP Port\Ports\192.168.1.5]
"HostName"="192.168.1.5"
"Port"="192.168.1.5"
"LocationInformation"="http://192.168.1.28:1215/"
"LocationInformation"="http://192.168.1.5:80/WebServices/Device"
"LocationInformation"="http://192.168.1.5:80/WebServices/Device"
"StandaloneDhcpAddress"="192.168.173.1"
"ScopeAddressBackup"="192.168.137.1"
"ScopeAddress"="192.168.137.1"
"DhcpIPAddress"="192.168.1.24"
"DhcpServer"="192.168.1.1"
"0.0.0.0,0.0.0.0,192.168.1.1,-1"=""
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Print\Monitors\Standard TCP/IP Port\Ports\192.168.1.5]
"HostName"="192.168.1.5"
"Port"="192.168.1.5"
"LocationInformation"="http://192.168.1.28:1215/"
"LocationInformation"="http://192.168.1.5:80/WebServices/Device"
"LocationInformation"="http://192.168.1.5:80/WebServices/Device"
"StandaloneDhcpAddress"="192.168.173.1"
"ScopeAddressBackup"="192.168.137.1"
"ScopeAddress"="192.168.137.1"
"DhcpIPAddress"="192.168.1.24"
"DhcpServer"="192.168.1.1"
"0.0.0.0,0.0.0.0,192.168.1.1,-1"=""
"MRU0"="192.168.16.93"
[HKEY_USERS\S-1-5-21-2054485685-3446499333-1556621121-1001\Software\Microsoft\Terminal Server Client\Servers\192.168.16.93]
"A"="192.168.1.23"
"B"="192.168.1.28"
"C"="192.168.1.200:5800"
"192.168.254.190::5901/extra"=hex:02,00
"00"="192.168.254.190:5901"
"ImagePrinterPort"="192.168.1.5"

I guess this way has a slim chance of false positives, but it's probably what is wanted 99.9% of the time. It also works for me under MINGW64 Git Bash. — mwfearnley, Jul 10 '17 at 14:45
This could be combined into a single sed command: > sed -ne "s/\x00//g" -e "/192\.168/p" — Firstrock, Aug 06 '20 at 12:54

kenorb · Answer 5 · 2019-01-17T13:09:58.010

9

`ripgrep`

Use ripgrep utility to grep UTF-16 files.

ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)

Example syntax:

rg sometext file

To dump all lines, run: rg -N . file.

edited Jan 17 '19 at 13:09

answered Jan 17 '19 at 12:55

kenorb

155,785
88
678
743

ripgrep is SO FAST! thank you – Matt Sephton Oct 28 '21 at 20:44

Dr. Alex RE · Answer 6 · 2020-04-16T00:21:14.420

ugrep (Universal grep) fully supports Unicode, UTF-8/16/32 input files, detects invalid Unicode to ensure proper results, displays text and binary files, and is fast and free:

ugrep searches UTF-8/16/32 input and other formats. Option --encoding permits many other file formats to be searched, such as ISO-8859-1 to 16, EBCDIC, code pages 437, 850, 858, 1250 to 1258, MacRoman, and KOI8.

See ugrep on GitHub for details.

score 4 · Answer 7 · answered Dec 11 '15 at 21:38

I needed to do this recursively, and here's what I came up with:

find -type f | while read l; do iconv -s -f utf-16le -t utf-8 "$l" | nl -s "$l: " | cut -c7- | grep 'somestring'; done

This is absolutely horrible and very slow; I'm certain there's a better way and I hope someone can improve on it -- but I was in a hurry :P

What the pieces do:

find -type f

gives a recursive list of filenames with paths relative to current

while read l; do ... done

Bash loop; for each line of the list of file paths, put the path into $l and do the thing in the loop. (Why I used a shell loop instead of xargs, which would've been much faster: I need to prefix each line of the output with the name of the current file. Couldn't think of a way to do that if I was feeding multiple files at once to iconv, and since I'm going to be doing one file at a time anyway, shell loop is easier syntax/escaping.)

iconv -s -f utf-16le -t utf-8 "$l"

Convert the file named in $l: assume the input file is utf-16 little-endian and convert it to utf-8. The -s makes iconv shut up about any conversion errors (there will be a lot, because some files in this directory structure are not utf-16). The output from this conversion goes to stdout.

nl -s "$l: " | cut -c7-

This is a hack: nl inserts line numbers, but it happens to have a "use this arbitrary string to separate the number from the line" parameter, so I put the filename (followed by colon and space) in that. Then I use cut to strip off the line number, leaving just the filename prefix. (Why I didn't use sed: escaping is much easier this way. If I used a sed expression, I have to worry about there regular expression characters in the filenames, which in my case there were a lot of. nl is much dumber than sed, and will just take the parameter -s entirely literally, and the shell handles the escaping for me.)

So, by the end of this pipeline, I've converted a bunch of files into lines of utf-8, prefixed with the filename, which I then grep. If there are matches, I can tell which file they're in from the prefix.

Caveats

This is much, much slower than grep -R, because I'm spawning a new copy of iconv, nl, cut, and grep for every single file. It's horrible.
Everything that isn't utf-16le input will come out as complete garbage, so if there's a normal ASCII file that contains 'somestring', this command won't report it -- you need to do a normal grep -R as well as this command (and if you have multiple unicode encoding types, like some big-endian and some little-endian files, you need to adjust this command and run it again for each different encoding).
Files whose name happens to contain 'somestring' will show up in the output, even if their contents have no matches.

I had to add 2>/dev/null to the iconv command to prevent spam of `iconv: incomplete character or shift sequence at end of buffer` — Arrrow, Nov 14 '22 at 09:07

score 0 · Answer 8 · edited Jun 01 '23 at 14:15

The sed statement is more than I can wrap my head around. I have a simplistic, far-from-perfect TCL script that I think does an OK job with my test point of one:

#!/usr/bin/tclsh

set insearch [lindex $argv 0]

set search ""

for {set i 0} {$i<[string length $insearch]-1} {incr i} {
    set search "${search}[string range $insearch $i $i]."
}
set search "${search}[string range $insearch $i $i]"

for {set i 1} {$i<$argc} {incr i} {
    set file [lindex $argv $i]
    set status 0
    if {! [catch {exec grep -a $search $file} results options]} {
        puts "$file: $results"
    }
}

score 0 · Answer 9 · answered Oct 16 '15 at 13:52

I added this as a comment to the accepted answer above but to make it easier to read. This allow you to search for text in a bunch of files while also displaying the filenames that it is finding the text. All of these files have a .reg extension since I'm searching through exported Windows Registry files. Just replace .reg with any file extension.

// Define grepreg in bash by pasting at bash command prompt
grepreg ()
{
    find -name '*.reg' -exec echo {} \; -exec iconv -f utf-16 -t utf-8 {} \; | grep "$1\|\.reg"
}

// Sample usage
grepreg SampleTextToSearch

score 0 · Answer 10 · answered May 20 '19 at 23:17

You can use the following Ruby's one-liner:

ruby -e "puts File.open('file.txt', mode:'rb:BOM|UTF-16LE').readlines.grep(Regexp.new 'PATTERN'.encode(Encoding::UTF_16LE))"

For simplicity, this can be defined as the shell function like:

grep-utf16() { ruby -e "puts File.open('$2', mode:'rb:BOM|UTF-16LE').readlines.grep(Regexp.new '$1'.encode(Encoding::UTF_16LE))"; }

Then it be used in similar way like grep:

grep-utf16 PATTERN file.txt

Source: How to use Ruby's readlines.grep for UTF-16 files?

Whilst this works it's magnitudes slower than ugrep for me on a UTF16LE text file with 450,000 lines. — Matt Sephton, Oct 28 '21 at 20:34

grepping binary files and UTF16

10 Answers10

`ripgrep`

Linked