6

I want to process files line by line. However, these files have different line separators: "\r", "\n" or "\r\n". I don't know which one they use or which kind of OS they come from.

I have two solutions:

  1. using bash command to translate these separators to "\n".

    cat file |
    tr '\r\n' '\n' |
    tr '\r' '\n' |
    ruby process.rb
    
  2. read the whole file and gsub these separators

    text=File.open('xxx.txt').read
    text.gsub!(/\r\n?/, "\n")
    text.each_line do |line|
      do some thing
    end
    

but the second solution is not good when the file is huge. See reference. Is there any other ruby idiomatic and efficient solution?

Community
  • 1
  • 1
ryan
  • 847
  • 1
  • 10
  • 18

1 Answers1

4

I suggest you first determine the line separator. I've assumed that you can do that by reading characters until you encounter "\n" or "\r" (or reach the end of the file, in which case we can regard "\n" as the line separator). If the character "\n" is found, I assume that to be the separator; if "\r" is found I attempt to read the next character. If I can do so and it is "\n", I return "\r\n" as the separator. If "\r" is the last character in the file or is followed by a character other than "\n", I return "\r" as the separator.

def separator(fname)
  f = File.open(fname)
  enum = f.each_char
  c = enum.next
  loop do
    case c[/\r|\n/]
    when "\n" then break
    when "\r"
      c << "\n" if enum.peek=="\n"
      break
    end
    c = enum.next
  end
  c[0][/\r|\n/] ? c : "\n"
end

Then process the file line-by-line

def process(fname)
  sep = separator(fname)
  IO.foreach(fname, sep) { |line| puts line }
end

I haven't converted "\r" or "\r\n" to "\n", but of course you could do that easily. Just open a file for writing and in process read each line and write it to the output file with the default line separator.

Let's try it (for clarity I show the value returned by separator):

fname = "temp"

IO.write(fname, "slash n line 1\nslash n line 2\n")
  #=> 30 
separator(fname)                                    
  #=> "\n" 
process(fname)
  # slash n line 1
  # slash n line 2

IO.write(fname, "slash r line 1\rslash r line 2\r", )
  #=> 30 
separator(fname)
  #=> "\r" 
process(fname)
  # slash r line 1
  # slash r line 2

IO.write(fname, "slash r slash n line 1\r\nslash r slash n line 2\r\n")
  #=> 48 
separator(fname)
  #=> "\r\n" 
process(fname)
  # slash r slash n line 1
  # slash r slash n line 2
Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
  • I think you need to catch an error if you reach the end of the file before hitting the `\n` or `\r` ? But thank you for the work you did! – pedz Mar 30 '20 at 00:38