Using binary data (strings in utf-8) from external file

Question

I have problem with using strings in UTF-8 format, e.g. "\u0161\u010D\u0159\u017E\u00FD". When such string is defined as variable in my program it works fine. But when I use such string by reading it from some external file I get the wrong output (I don't get what I want/expect). Definitely I'm missing some necessary encoding stuff...

My code:

file  = "c:\\...\\vlmList_unicode.txt" #\u306b\u3064\u3044\u3066
data = File.open(file, 'rb') { |io| io.read.split(/\t/) }
puts data
data_var = "\u306b\u3064\u3044\u3066"
puts data_var

Output:

\u306b\u3064\u3044\u3066 # what I don't want
について # what I want

I'm trying to read the file in binary form by specifying 'rb' but obviously there is some other problem... I run my code in Netbeans 7.3.1 with build in JRuby 1.7.3 (I tried also Ruby 2.0.0 but without any effect.)

Since I'm new in ruby world any ideas are welcomed...

score 1 · Accepted Answer · edited May 23 '17 at 11:59

If your file contains the literal escaped string:

\u306b\u3064\u3044\u3066

Then you will need to unescape it after reading. Ruby does this for you with string literals, which is why the second case worked for you. Taken from the answer to "Is this the best way to unescape unicode escape sequences in Ruby?", you can use this:

file  = "c:\\...\\vlmList_unicode.txt" #\u306b\u3064\u3044\u3066
data = File.open(file, 'rb') { |io| 
  contents = io.read.gsub(/\\u([\da-fA-F]{4})/) { |m| 
    [$1].pack("H*").unpack("n*").pack("U*")
  }
  contents.split(/\t/)
}

Alternatively, if you will like to make it more readable, extract the substitution into a new method, and add it to the String class:

class String
  def unescape_unicode
    self.gsub(/\\u([\da-fA-F]{4})/) { |m| 
      [$1].pack("H*").unpack("n*").pack("U*")
    }
  end
end

Then you can call:

file  = "c:\\...\\vlmList_unicode.txt" #\u306b\u3064\u3044\u3066
data = File.open(file, 'rb') { |io| 
  io.read.unescape_unicode.split(/\t/)
}

Thanks, this helps. Also thank you for comments which clarify the problem. — jivanko, Jul 24 '13 at 11:41
Glad to help. Would you consider upvoting since you found it useful? Thanks — Jon Cairns, Jul 24 '13 at 14:42
Sure, but I cannot upvote yet because of my reputation. As soon as I can I definitely will. Thanks again. — jivanko, Jul 24 '13 at 15:29

score 0 · Answer 2 · answered Jul 24 '13 at 15:48

Just as a FYI:

data = File.open(file, 'rb') { |io| io.read.split(/\t/) }

Can be written more simply as one of these:

data = File.read(file, 'rb').split(/\t/)
data = File.readlines(file, "\t", 'mode' => 'rb')

(Remember that File inherits from IO, which is where these methods are defined, so look in IO for documentation on them.)

readlines takes a "separator" parameter, which in the example above is "\t". Ruby will substitute it for the usual "\n" on *nix or Mac OS, or "\r\n" on Windows, so records will be retrieved using the tab-delimiter.

This makes me wonder a bit why you'd want to do that though? I've never seen tabs as record delimiters, only column/field delimiters in "TSV" (Tab-Seperated-Value) files. So that leads me to think you should probably be using Ruby's CSV class, with a "\t" as the column-separator. But, without samples of the actual file you're reading I can't say for sure.

Actually in my file tabs are column delimiters, lines looks like this: `key1 \u306b_1 user1`. Without literal escaped strings I used `File.readlines` and I got 2D array. With literal escaped strings I was not able to process the file correctly. From this page [link](http://blog.leosoto.com/2008/03/reading-binary-file-on-ruby.html) I used I used `File.open(file, 'rb') { |io| io.read.split(/\t/) }`. After I modified answer from @joonty I get 2D array again. Definitely I'm not doing it in the correct ruby way but it works for me. What I need is to read such file and create 2D array. Thanks. — jivanko, Jul 25 '13 at 07:01

Using binary data (strings in utf-8) from external file

2 Answers2