1

I'm trying to print the first 5 lines from a set of large (>500MB) csv files into small headers in order to inspect the content more easily.

I'm using Ruby code to do this but am getting each line padded out with extra Chinese characters, like this:

 week_num   type    ID  location    total_qty   A_qty   B_qty   count਍㌀㐀ऀ猀漀爀琀愀戀氀攀ऀ㄀㤀㜀ऀ䐀䔀开伀渀氀礀ऀ㔀㐀㜀㈀ ㌀ऀ㔀㐀㜀㈀ ㌀ऀ ऀ㤀㄀㈀㔀㌀ഀ
 44 small   14  A   907859  907859  0   550360਍㐀㄀ऀ猀漀爀琀愀戀氀攀ऀ㐀㈀㄀ऀ䐀䔀开伀渀氀礀ऀ㌀ ㈀㄀㜀㐀ऀ㌀ ㈀㄀

The first few lines of input file are like so:

 week_num   type    ID  location    total_qty   A_qty   B_qty   count
 34 small   197 A   547203  547203  0   91253
 44 small   14  A   907859  907859  0   550360
 41 small   421 A   302174  302174  0   18198

The strange characters appear to be Line 1 and Line 3 of the data.

Here's my Ruby code:

num_lines=ARGV[0]  
fh = File.open(file_in,"r")
fw = File.open(file_out,"w")
until (line=fh.gets).nil? or num_lines==0
    fw.puts line if outflag
    num_lines = num_lines-1
end

Any idea what's going on and what I can do to simply stop at the line end character?


Looking at input/output files in hex (useful suggestion by @user1934428)

Input file - each character looks to be two bytes.

enter image description here

Output file - notice the NULL (00) between each single byte character...

enter image description here

Ruby version 1.9.1

Assad Ebrahim
  • 6,234
  • 8
  • 42
  • 68
  • 1
    Can you include the first few lines of your input file? – Jared Beck Nov 10 '15 at 04:24
  • Some of those characters aren't Chinese. Are you getting some sort of binary data that's just getting displayed as if it were UTF-8? – davejagoda Nov 10 '15 at 05:07
  • @JaredBeck: Added input file. Looks like Lines 1 and 3 are being mangled, while Line 2 is passing through to the output as normal. – Assad Ebrahim Nov 10 '15 at 06:09
  • 1
    You should look at your input file in hex format, not as text. My guess is that the files are in a different encoding than your Ruby program expects, so the first step would be to determine the encoding of the file. In addition, we need to know the Ruby version which you are using, because the way Ruby deals with the encoding of external files, changed over time. – user1934428 Nov 10 '15 at 06:53
  • @user1934428 - Great idea! Hex views of input & output added. Does look like an encoding mismatch: input seems to be 16-bit (2 byte) encoding. Output has NULL (00) byte between every 8-bit (1 byte) character. I'd guess readline is interpreting the input as an 8-bit format, hence spitting out NULL characters. Ruby version 1.9.1 – Assad Ebrahim Nov 10 '15 at 10:18
  • @user1934428 - Problem solved. Your hint was excellent, and let me to finding the answer. Thanks :) – Assad Ebrahim Nov 10 '15 at 11:58

1 Answers1

1

The problem is an encoding mismatch which is happening because the encoding is not explicitly specified in the read and write parts of the code. Read the input csv as a binary file "rb" with utf-16le encoding. Write the output in the same format.

num_lines=ARGV[0]  

# ****** Specifying the right encodings  <<<< this is the key
fh = File.open(file_in,"rb:utf-16le")
fw = File.open(file_out,"wb:utf-16le")

until (line=fh.gets).nil? or num_lines==0
    fw.puts line
    num_lines = num_lines-1
end

Useful references:

Community
  • 1
  • 1
Assad Ebrahim
  • 6,234
  • 8
  • 42
  • 68