5

How can I achieve the equivalent of:

tail -f file.txt | grep 'regexp'

to only output the buffered lines that match a regular expression such as 'Result' from the file type:

$ file file.txt
file.txt:Little-endian UTF-16 Unicode text, with CRLF line terminators

Example of the tail -f stream content below converted to utf-8:

Package end.

Total warnings: 40
Total errors: 0
Elapsed time: 24.4267192 secs.
...Package Executed.

Result: Success

Awk?

The problems in piping to grep led me to awk as a on-stop-shop solution for stripping the offending characters and also producing matched lines from regex.

awk seems to be giving the most promising results, however, I am finding that it returns the whole stream rather than individual matching lines:

tail -f file.txt | awk '{sub("/[^\x20-\x7F]/", "");/Result/;print}'
Package end.

Total warnings: 40
Total errors: 0
Elapsed time: 24.4267192 secs.
...Package Executed.

Result: Success

What I have tried

  • converting the stream and piping to grep

    tail -f file.txt | iconv -t UTF-8 | grep 'regexp'
    
  • using luit to change terminal encoding as per this post

    luit -encoding UTF-8 -- tail -f file.txt | grep 'regexp'
    
  • delete non ASCII characters, described here, then piping to grep

    tail -f file.txt | tr -d '[^\x20-\x7F]' | grep 'regexp'
    tail -f file.txt | sed 's/[^\x00-\x7F]//' | grep 'regexp'
    
  • various combinations of the above using grep flags --line-buffered, -a as well as sed -u

  • using luit -encoding UTF-8 -- pre-pended to the above
  • using a file with the same encoding containing the regular expression for grep -f

Why they failed

  • Most attempts, simply nothing is printed to the screen because grep searches 'regexp' when in fact the text is something like '\x00r\x00e\x00g\x00e\x00x\x00p' - for example 'R' will return the line 'Result: Success' but 'Result' won't
  • If a full regular expression gets a match, such as in the case of using grep -f, it will return the whole stream and doesn't seem to just return the matched lines
  • piping through sed or tr or iconv seems to break the pipe to grep and grep seems to still only be able to match individual characters

Edit

I looked at the raw file in it's utf-16 format using xxd with an aim of using regex to match the encoding, which gave the following output:

$ tail file.txt | xxd
00000000: 0050 0061 0063 006b 0061 0067 0065 0020  .P.a.c.k.a.g.e.
00000010: 0065 006e 0064 002e 000d 000a 000d 000a  .e.n.d..........
00000020: 0054 006f 0074 0061 006c 0020 0077 0061  .T.o.t.a.l. .w.a
00000030: 0072 006e 0069 006e 0067 0073 003a 0020  .r.n.i.n.g.s.:.
00000040: 0034 0030 000d 000a 0054 006f 0074 0061  .4.0.....T.o.t.a
00000050: 006c 0020 0065 0072 0072 006f 0072 0073  .l. .e.r.r.o.r.s
00000060: 003a 0020 0030 000d 000a 0045 006c 0061  .:. .0.....E.l.a
00000070: 0070 0073 0065 0064 0020 0074 0069 006d  .p.s.e.d. .t.i.m
00000080: 0065 003a 0020 0032 0034 002e 0034 0032  .e.:. .2.4...4.2
00000090: 0036 0037 0031 0039 0032 0020 0073 0065  .6.7.1.9.2. .s.e
000000a0: 0063 0073 002e 000d 000a 002e 002e 002e  .c.s............
000000b0: 0050 0061 0063 006b 0061 0067 0065 0020  .P.a.c.k.a.g.e.
000000c0: 0045 0078 0065 0063 0075 0074 0065 0064  .E.x.e.c.u.t.e.d
000000d0: 002e 000d 000a 000d 000a 0052 0065 0073  ...........R.e.s
000000e0: 0075 006c 0074 003a 0020 0053 0075 0063  .u.l.t.:. .S.u.c
000000f0: 0063 0065 0073 0073 000d 000a 000d 000a  .c.e.s.s........
00000100: 00
Community
  • 1
  • 1
Alexander McFarlane
  • 10,643
  • 9
  • 59
  • 100
  • Please consider trying your pieces individually to make sure they do what you think they do before putting them together. Many of these are not related to UTF-16 at all, but basic tool usage. For example, `printf "foo\nbar\n" | awk '{/foo/;print}'` will show that you're not using `awk` right, and `printf '\x40\x00' | iconv -t UTF-8 | od -t x1` for `iconv`, and `echo hello | tr -d '[^\x20-\x7F]'` for `tr` – that other guy Jun 23 '15 at 22:52
  • Tha is true, but from my reading I am aware that [some](http://unix.stackexchange.com/a/35695) of these tools can't be used properly for piping `tail -f` in `utf-16` I just wanted to include them to show the lengths I have gone to when investigating this issue. If I am indeed using them incorrectly and they are appropriate, a simple tweak to an advanced user of these tools would be able to give the desired output. As I have done all the ground work already – Alexander McFarlane Jun 23 '15 at 22:56
  • All of them can be used for piping `tail -f` – that other guy Jun 23 '15 at 22:57
  • Which OS are you trying to do this on? – that other guy Jun 23 '15 at 23:02
  • I believe I may have overlooked the simplest solution which is a regex that matches anythjing between the specified characters... e.g. `'R.e.s.u.l.t'` or `tail -f file.txt | grep -a 'R.e.s.u.l.t'` – Alexander McFarlane Jun 23 '15 at 23:24
  • related: http://stackoverflow.com/questions/3752913/grepping-binary-files-and-utf16 – golimar May 07 '17 at 09:09

3 Answers3

1

The sloppiest solution that should work on Cygwin is fixing your awk statement:

tail -f file.txt | \
    LC_CTYPE=C awk '{ gsub("[^[:print:]]", ""); if($0 ~ /Result/) print; }'

This has a few bugs that cancel each other out, like tail cutting a UTF-16LE file in awkward places but awk stripping what we hope is garbage.

A robust solution might be:

tail -c +1 -f file.txt | \
    script -qc 'iconv -f UTF-16LE -t UTF-8' /dev/null | grep Result

but it reads the entire file and I don't know how well Cygwin works with using script to convince iconv not to buffer (it would work on GNU/Linux).

that other guy
  • 116,971
  • 11
  • 170
  • 194
  • Woprks well in this specific instance, however, `if($0 ~ /Result/) print` would fail though if I were to match something that didn't fall into the first column created by `awk` though? Hence, it wouldn't work as a generic search for the full line – Alexander McFarlane Jun 23 '15 at 23:22
  • 1
    @alexmcf `$0` is the full line – that other guy Jun 23 '15 at 23:50
0

I realised a simple regex to ignore any characters between letters in the search string might work...

This matches 'Result' whilst allowing any one character between each letter...

$ tail -f file.txt | grep -a 'R.e.s.u.l.t'
Result: Success

$ tail -f file.txt | awk '/R.e.s.u.l.t./'
Result: Success

or as per this answer: to avoid typing all the tedious dots...

search="Result"
tail -f file.txt | grep -a -e "$(echo "$search" | sed 's/./&./g')"
Community
  • 1
  • 1
Alexander McFarlane
  • 10,643
  • 9
  • 59
  • 100
  • If your file is large, mostly ascii and you just want to visually inspect the output in a VT110-ish terminal, this might be just as good and more readable. You'd have your original problem again if you try to further pipe or redirect it though. – that other guy Jun 24 '15 at 00:02
  • yeah I'm just outputting to an aditional screen with a beep on a new event – Alexander McFarlane Jun 24 '15 at 00:09
0

You can use ripgrep instead which will handle nicely UTF-16 without having to convert your input

tail -f file.txt | rg regexp