2

Given the following two files created by the following commands:

$ printf "foo\nbar\nbaz\n" | iconv -t UTF-8 > utf-8.txt
$ printf "foo\nbar\nbaz\n" | iconv -t UTF-16 > utf-16.txt
$ file utf-8.txt utf-16.txt
utf-8.txt:  ASCII text
utf-16.txt: Little-endian UTF-16 Unicode text

I'd like to find the matching pattern in UTF-16 formatted file, the same way as in UTF-8 using Ruby.

Here is the working example for UTF-8 file:

$ ruby -e 'puts File.open("utf-8.txt").readlines.grep(/foo/)'
foo

However, it doesn't work for UTF-16LE formatted file:

$ ruby -e 'puts File.open("utf-16.txt").readlines.grep(/foo/)'
Traceback (most recent call last):
    3: from -e:1:in `<main>'
    2: from -e:1:in `grep'
    1: from -e:1:in `each'
-e:1:in `===': invalid byte sequence in US-ASCII (ArgumentError)

I've tried to convert the file based on this post by:

$ ruby -e 'puts File.open("utf-16.txt", "r").read.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)' 
ÿþfoo
bar
baz

but it prints some invalid characters (ÿþ) before foo, secondly I don't know how to use grep method after conversion (it reports as undefined method).

How I can use readlines.grep() method for UTF-16 file? Or some other simple way, where my goal is to print the lines with the specific regex pattern.


Ideally in one line, so the command can be used for CI tests.

Here is some real world scenario:

ruby -e 'if File.readlines("utf-16.log").grep(/[1-9] error/) {exit 1}; end'

but the command doesn't work due to UTF-16 formatting of the log file.

kenorb
  • 155,785
  • 88
  • 678
  • 743

2 Answers2

3

While the answer by Viktor is technically correct, recoding of the whole file from UTF-16LE into UTF-8 is unnecessary and might hit the performance. All you actually need is to build the regexp in the same encoding:

puts File.open(
  "utf-16.txt", mode: "rb:BOM|UTF-16LE"
).readlines.grep(
  Regexp.new "foo".encode(Encoding::UTF_16LE)
)
#⇒ foo
Aleksei Matiushkin
  • 119,336
  • 10
  • 100
  • 160
  • 2
    I was thinking about performance hit for large log files, too. I edited the answer with a bit different way that I found - adding `mode: rb:BOM|UTF-16LE:UTF-8`, which according to the doc will do the following: Strings read will be tagged by UTF-16LE when reading, and strings output will be converted to UTF-8 when writing. Not sure if "tagged" means the same thing as invoking `encode`. Anyways I didn't know that you can convert Regexp string to different encoding, so I like your solution. – Viktor Nonov Feb 17 '19 at 05:42
  • @ViktorNonov well, regexp is just a set of bytes to be matched directly + several control entities (all within ASCII-7.) FSM behind would match them byte by byte, there is no magic :) – Aleksei Matiushkin Feb 17 '19 at 05:45
2

Short answer:

You almost have it, just need to say which characters you want to replace (I would guess the invalid and the undefined):

$ ruby -e 'puts File.open("utf-16.txt", "r").read.encode("UTF-8", invalid: :replace, undef: :replace, replace: "")'
foo
bar
baz

Also I don't think you need force_encoding.

If you want to ignore the BOM convert on open and use readlines you can use:

 ruby -e 'puts File.open("utf-16.txt", mode: "rb:BOM|UTF-16LE:UTF-8").readlines.grep(/foo/)'

More details:

The reason why you get invalid characters when you do:

$ruby -e 'puts File.open("utf-16.txt", "r").read.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)'
ÿþfoo
bar
baz

is that in the beginning of each file which is in Unicode you can have the Byte Order Mark which shows the byte order and the encoding form. In your case it is FE FF (meaning Little-endian UTF-16), which are invalid UTF-8 characters.

You can verify that by invoking encode without force_encoding:

$ruby -e 'puts File.open("utf-16.txt", "r").read.encode("utf-8")'
��foo
bar
baz

Question marks in black box are used to replace an unknown, unrecognized or unrepresentable character.

You can check more on BOM here.

Viktor Nonov
  • 1,472
  • 1
  • 12
  • 26