57

Is there a good way to read, edit, and write files in place in Ruby?

In my online search I've found stuff suggesting to read it all into an array, modify said array, then write everything out. I feel like there should be a better solution, especially if I'm dealing with a very big file.

Something like:

myfile = File.open("path/to/file.txt", "r+")

myfile.each do |line|
    myfile.replace_puts('blah') if line =~ /myregex/
end

myfile.close

Where replace_puts would write over the current line, rather than (over)writing the next line as it currently does because the pointer is at the end of the line (after the separator).

So then every line that matches /myregex/ will be replaced with 'blah'. Obviously what I have in mind is a bit more involved than that, as far as processing, and would be done in one line, but the idea is the same - I want to read a file line by line, and edit certain lines, and write out when I'm done.

Maybe there's a way to just say "rewind back to just after the last separator"? Or some way of using each_with_index and write via a line index number? I couldn't find anything of the sort, though.

The best solution I have so far is to read things line-wise, write them out to a new (temp) file line-wise (possibly edited), then overwrite the old file with the new temp file and delete. Again, I feel like there should be a better way - I don't think I should have to create a new 1gig file just to edit some lines in an existing 1GB file.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Hsiu
  • 571
  • 1
  • 4
  • 4
  • 1
    Consider the results if your code to read then overwrite were to fail partway through the process: You run the risk of destroying the file. – the Tin Man Dec 10 '10 at 00:29
  • Alright, as a follow-up question: from the command line, you can do this: ruby -pe "gsub(/blah/,'newstuff')" whatev.txt. That does what I want to do, but I don't want to do it on the command line like that, I want to put it inside something larger. Can anyone tell me, internally, what that command is doing that gives the illusion of editing a file, line by line? Is it writing to a temp file, or using arrays? Because it seems to work on quite large files fairly quickly, moreso than the suggestions offered here so far. – Hsiu Dec 10 '10 at 08:42
  • That's a great question. Could you please make it into a new question? That makes it much easier for others to see it and answer it. Also, if this question was answered to your satisfaction, can you please accept that answer? Thanks! – Wayne Conrad Dec 16 '10 at 23:20
  • While it might seem inefficient to read a file line-by-line, and write to a new file, in reality, [the speed is equal-to or better-than trying to read a huge file into memory](http://stackoverflow.com/a/25189286/128421), modify it and write it back. It's an accepted programming practice to do it this way, and, no, there really isn't a better solution once you factor in the speed, memory requirements, and data safety. – the Tin Man Nov 26 '14 at 18:18

4 Answers4

77

In general, there's no way to make arbitrary edits in the middle of a file. It's not a deficiency of Ruby. It's a limitation of the file system: Most file systems make it easy and efficient to grow or shrink the file at the end, but not at the beginning or in the middle. So you won't be able to rewrite a line in place unless its size stays the same.

There are two general models for modifying a bunch of lines. If the file is not too large, just read it all into memory, modify it, and write it back out. For example, adding "Kilroy was here" to the beginning of every line of a file:

path = '/tmp/foo'
lines = IO.readlines(path).map do |line|
  'Kilroy was here ' + line
end
File.open(path, 'w') do |file|
  file.puts lines
end

Although simple, this technique has a danger: If the program is interrupted while writing the file, you'll lose part or all of it. It also needs to use memory to hold the entire file. If either of these is a concern, then you may prefer the next technique.

You can, as you note, write to a temporary file. When done, rename the temporary file so that it replaces the input file:

require 'tempfile'
require 'fileutils'

path = '/tmp/foo'
temp_file = Tempfile.new('foo')
begin
  File.open(path, 'r') do |file|
    file.each_line do |line|
      temp_file.puts 'Kilroy was here ' + line
    end
  end
  temp_file.close
  FileUtils.mv(temp_file.path, path)
ensure
  temp_file.close
  temp_file.unlink
end

Since the rename (FileUtils.mv) is atomic, the rewritten input file will pop into existence all at once. If the program is interrupted, either the file will have been rewritten, or it will not. There's no possibility of it being partially rewritten.

The ensure clause is not strictly necessary: The file will be deleted when the Tempfile instance is garbage collected. However, that could take a while. The ensure block makes sure that the tempfile gets cleaned up right away, without having to wait for it to be garbage collected.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Wayne Conrad
  • 103,207
  • 26
  • 155
  • 191
  • 1
    +1 It's always better to be conservative when modifying files, especially big ones. – the Tin Man Dec 10 '10 at 00:31
  • you are about to close the temp_file, why rewind it? – hihell Oct 25 '13 at 08:43
  • @hihell, BookOfGreg's edit added the rewind; his remark was: "FileUtils.mv will write a blank file unless the temporary file is rewound. Also best practice is to make sure temp file is closed and unlinked after usage." – Wayne Conrad Oct 25 '13 at 13:39
  • 1
    What happens in the second scenario to the file's created date? Will `FileUtils.mv` cause us to end up with a file that looks as if it had been created just now? If so, that's a very big difference between the two scenarios (as the first one leaves the file created date alone). – matt May 11 '14 at 05:47
  • 1
    @Matt I've never thought about it this technique's effect upon the creation date, but it seems obvious that you are correct. – Wayne Conrad May 11 '14 at 06:52
  • it seems like this functionality suggests improved API for rewriting file contents to the same file line by line. Perhaps the temp file should be internal implementation of a certain method, rather than being imperative. – ahnbizcad Aug 31 '16 at 22:05
  • Thank you also showing the simple one, and describing why it’s bad. I have a case where content will be discarded if write was interrupted, so the simple case is all good here. – Smar Nov 08 '20 at 05:12
9

If you want to overwrite a file line by line, you'll have to ensure the new line has the same length as the original line. If the new line is longer, part of it will be written over the next line. If the new line is shorter, the remainder of the old line just stays where it is. The tempfile solution is really much safer. But if you're willing to take a risk:

File.open('test.txt', 'r+') do |f|   
    old_pos = 0
    f.each do |line|
        f.pos = old_pos   # this is the 'rewind'
        f.print line.gsub('2010', '2011')
        old_pos = f.pos
    end
end

If the line size does change, this is a possibility:

File.open('test.txt', 'r+') do |f|   
    out = ""
    f.each do |line|
        out << line.gsub(/myregex/, 'blah') 
    end
    f.pos = 0                     
    f.print out
    f.truncate(f.pos)             
end
steenslag
  • 79,051
  • 16
  • 138
  • 171
  • Is the 2nd solution apt for large files containing millions of lines ? Won't it take space in memory for that operation ? – Y M Sep 05 '15 at 03:17
4

Just in case you are using Rails or Facets, or you otherwise depend on Rails' ActiveSupport, you can use the atomic_write extension to File:

File.atomic_write('path/file') do |file|
  file.write('your content')
end

Behind the scenes, this will create a temporary file which it will later move to the desired path, taking care of closing the file for you.

It further clones the file permissions of the existing file or, if there isn't one, of the current directory.

Kostas Rousis
  • 5,918
  • 1
  • 33
  • 38
2

You can write in the middle of a file but you have to be carefull to keep the length of the string you overwrite the same otherwise you overwrite some of the following text. I give an example here using File.seek, IO::SEEK_CUR gives he current position of the file pointer, at the end of the line that is just read, the +1 is for the CR character at the end of the line.

look_for     = "bbb"
replace_with = "xxxxx"

File.open(DATA, 'r+') do |file|
  file.each_line do |line|
    if (line[look_for])
      file.seek(-(line.length + 1), IO::SEEK_CUR)
      file.write line.gsub(look_for, replace_with)
    end
  end
end
__END__
aaabbb
bbbcccddd
dddeee
eee

After executed, at the end of the script you now have the following, not what you had in mind I assume.

aaaxxxxx
bcccddd
dddeee
eee

Taking that in consideration, the speed using this technique is much better than the classic 'read and write to a new file' method. See these benchmarks on a file with music data of 1.7 GB big. For the classic approach I used the technique of Wayne. The benchmark is done withe the .bmbm method so that caching of the file doesn't play a very big deal. Tests are done with MRI Ruby 2.3.0 on Windows 7. The strings were effectively replaced, I checked both methods.

require 'benchmark'
require 'tempfile'
require 'fileutils'

look_for      = "Melissa Etheridge"
replace_with  = "Malissa Etheridge"
very_big_file = 'D:\Documents\muziekinfo\all.txt'.gsub('\\','/')

def replace_with file_path, look_for, replace_with
  File.open(file_path, 'r+') do |file|
    file.each_line do |line|
      if (line[look_for])
        file.seek(-(line.length + 1), IO::SEEK_CUR)
        file.write line.gsub(look_for, replace_with)
      end
    end
  end
end

def replace_with_classic path, look_for, replace_with
  temp_file = Tempfile.new('foo')
  File.foreach(path) do |line|
    if (line[look_for])
      temp_file.write line.gsub(look_for, replace_with)
    else
      temp_file.write line
    end
  end
  temp_file.close
  FileUtils.mv(temp_file.path, path)
ensure
  temp_file.close
  temp_file.unlink
end

Benchmark.bmbm do |x| 
  x.report("adapt          ") { 1.times {replace_with very_big_file, look_for, replace_with}}
  x.report("restore        ") { 1.times {replace_with very_big_file, replace_with, look_for}}
  x.report("classic adapt  ") { 1.times {replace_with_classic very_big_file, look_for, replace_with}}
  x.report("classic restore") { 1.times {replace_with_classic very_big_file, replace_with, look_for}}
end 

Which gave

Rehearsal ---------------------------------------------------
adapt             6.989000   0.811000   7.800000 (  7.800598)
restore           7.192000   0.562000   7.754000 (  7.774481)
classic adapt    14.320000   9.438000  23.758000 ( 32.507433)
classic restore  14.259000   9.469000  23.728000 ( 34.128093)
----------------------------------------- total: 63.040000sec

                      user     system      total        real
adapt             7.114000   0.718000   7.832000 (  8.639864)
restore           6.942000   0.858000   7.800000 (  8.117839)
classic adapt    14.430000   9.485000  23.915000 ( 32.195298)
classic restore  14.695000   9.360000  24.055000 ( 33.709054)

So the in_file replacement was 4 times faster.

peter
  • 41,770
  • 5
  • 64
  • 108