grep and tail -f for a UTF-16 binary file - trying to use simple awk

Question

How can I achieve the equivalent of:

tail -f file.txt | grep 'regexp'

to only output the buffered lines that match a regular expression such as 'Result' from the file type:

$ file file.txt
file.txt:Little-endian UTF-16 Unicode text, with CRLF line terminators

Example of the tail -f stream content below converted to utf-8:

Package end.

Total warnings: 40
Total errors: 0
Elapsed time: 24.4267192 secs.
...Package Executed.

Result: Success

Awk?

The problems in piping to grep led me to awk as a on-stop-shop solution for stripping the offending characters and also producing matched lines from regex.

awk seems to be giving the most promising results, however, I am finding that it returns the whole stream rather than individual matching lines:

tail -f file.txt | awk '{sub("/[^\x20-\x7F]/", "");/Result/;print}'
Package end.

Total warnings: 40
Total errors: 0
Elapsed time: 24.4267192 secs.
...Package Executed.

Result: Success

What I have tried

converting the stream and piping to grep

tail -f file.txt | iconv -t UTF-8 | grep 'regexp'

using luit to change terminal encoding as per this post

luit -encoding UTF-8 -- tail -f file.txt | grep 'regexp'

delete non ASCII characters, described here, then piping to grep

tail -f file.txt | tr -d '[^\x20-\x7F]' | grep 'regexp'
tail -f file.txt | sed 's/[^\x00-\x7F]//' | grep 'regexp'

various combinations of the above using grep flags --line-buffered, -a as well as sed -u
using luit -encoding UTF-8 -- pre-pended to the above
using a file with the same encoding containing the regular expression for grep -f

Why they failed

Most attempts, simply nothing is printed to the screen because grep searches 'regexp' when in fact the text is something like '\x00r\x00e\x00g\x00e\x00x\x00p' - for example 'R' will return the line 'Result: Success' but 'Result' won't
If a full regular expression gets a match, such as in the case of using grep -f, it will return the whole stream and doesn't seem to just return the matched lines
piping through sed or tr or iconv seems to break the pipe to grep and grep seems to still only be able to match individual characters

Edit

I looked at the raw file in it's utf-16 format using xxd with an aim of using regex to match the encoding, which gave the following output:

$ tail file.txt | xxd
00000000: 0050 0061 0063 006b 0061 0067 0065 0020  .P.a.c.k.a.g.e.
00000010: 0065 006e 0064 002e 000d 000a 000d 000a  .e.n.d..........
00000020: 0054 006f 0074 0061 006c 0020 0077 0061  .T.o.t.a.l. .w.a
00000030: 0072 006e 0069 006e 0067 0073 003a 0020  .r.n.i.n.g.s.:.
00000040: 0034 0030 000d 000a 0054 006f 0074 0061  .4.0.....T.o.t.a
00000050: 006c 0020 0065 0072 0072 006f 0072 0073  .l. .e.r.r.o.r.s
00000060: 003a 0020 0030 000d 000a 0045 006c 0061  .:. .0.....E.l.a
00000070: 0070 0073 0065 0064 0020 0074 0069 006d  .p.s.e.d. .t.i.m
00000080: 0065 003a 0020 0032 0034 002e 0034 0032  .e.:. .2.4...4.2
00000090: 0036 0037 0031 0039 0032 0020 0073 0065  .6.7.1.9.2. .s.e
000000a0: 0063 0073 002e 000d 000a 002e 002e 002e  .c.s............
000000b0: 0050 0061 0063 006b 0061 0067 0065 0020  .P.a.c.k.a.g.e.
000000c0: 0045 0078 0065 0063 0075 0074 0065 0064  .E.x.e.c.u.t.e.d
000000d0: 002e 000d 000a 000d 000a 0052 0065 0073  ...........R.e.s
000000e0: 0075 006c 0074 003a 0020 0053 0075 0063  .u.l.t.:. .S.u.c
000000f0: 0063 0065 0073 0073 000d 000a 000d 000a  .c.e.s.s........
00000100: 00

Please consider trying your pieces individually to make sure they do what you think they do before putting them together. Many of these are not related to UTF-16 at all, but basic tool usage. For example, `printf "foo\nbar\n" | awk '{/foo/;print}'` will show that you're not using `awk` right, and `printf '\x40\x00' | iconv -t UTF-8 | od -t x1` for `iconv`, and `echo hello | tr -d '[^\x20-\x7F]'` for `tr` — that other guy, Jun 23 '15 at 22:52
Tha is true, but from my reading I am aware that [some](http://unix.stackexchange.com/a/35695) of these tools can't be used properly for piping `tail -f` in `utf-16` I just wanted to include them to show the lengths I have gone to when investigating this issue. If I am indeed using them incorrectly and they are appropriate, a simple tweak to an advanced user of these tools would be able to give the desired output. As I have done all the ground work already — Alexander McFarlane, Jun 23 '15 at 22:56
I believe I may have overlooked the simplest solution which is a regex that matches anythjing between the specified characters... e.g. `'R.e.s.u.l.t'` or `tail -f file.txt | grep -a 'R.e.s.u.l.t'` — Alexander McFarlane, Jun 23 '15 at 23:24
related: http://stackoverflow.com/questions/3752913/grepping-binary-files-and-utf16 — golimar, May 07 '17 at 09:09

score 1 · Answer 1 · answered Jun 23 '15 at 23:15

1

The sloppiest solution that should work on Cygwin is fixing your awk statement:

tail -f file.txt | \
    LC_CTYPE=C awk '{ gsub("[^[:print:]]", ""); if($0 ~ /Result/) print; }'

This has a few bugs that cancel each other out, like tail cutting a UTF-16LE file in awkward places but awk stripping what we hope is garbage.

A robust solution might be:

tail -c +1 -f file.txt | \
    script -qc 'iconv -f UTF-16LE -t UTF-8' /dev/null | grep Result

but it reads the entire file and I don't know how well Cygwin works with using script to convince iconv not to buffer (it would work on GNU/Linux).

answered Jun 23 '15 at 23:15

that other guy

116,971
11
170
194

Woprks well in this specific instance, however, `if($0 ~ /Result/) print` would fail though if I were to match something that didn't fall into the first column created by `awk` though? Hence, it wouldn't work as a generic search for the full line – Alexander McFarlane Jun 23 '15 at 23:22
1

@alexmcf `$0` is the full line – that other guy Jun 23 '15 at 23:50

score 0 · Accepted Answer · edited May 23 '17 at 10:31

0

I realised a simple regex to ignore any characters between letters in the search string might work...

This matches 'Result' whilst allowing any one character between each letter...

$ tail -f file.txt | grep -a 'R.e.s.u.l.t'
Result: Success

$ tail -f file.txt | awk '/R.e.s.u.l.t./'
Result: Success

or as per this answer: to avoid typing all the tedious dots...

search="Result"
tail -f file.txt | grep -a -e "$(echo "$search" | sed 's/./&./g')"

edited May 23 '17 at 10:31

Community

1
1

answered Jun 23 '15 at 23:28

Alexander McFarlane

10,643
9
59
100

If your file is large, mostly ascii and you just want to visually inspect the output in a VT110-ish terminal, this might be just as good and more readable. You'd have your original problem again if you try to further pipe or redirect it though. – that other guy Jun 24 '15 at 00:02
yeah I'm just outputting to an aditional screen with a beep on a new event – Alexander McFarlane Jun 24 '15 at 00:09

score 0 · Answer 3 · answered Oct 11 '22 at 08:13

0

You can use ripgrep instead which will handle nicely UTF-16 without having to convert your input

tail -f file.txt | rg regexp

answered Oct 11 '22 at 08:13

Cyril Chaboisseau

451
4
10

grep and tail -f for a UTF-16 binary file - trying to use simple awk

Awk?

What I have tried

Why they failed

Edit

3 Answers3