0

I perform simple parsing of large files. I'm trying to select blocks from a large text file and write these blocks into a new text file. My current method works very slowly because parsing files include more 3 million strings. For example: file for parsing:

1test
1111
2222
3333
4444
1test
5555
6666
2test
5555
4444
3test
0000
4test
9999
0000
5test
3333
3333
8test
2222
9test
6666
11test
1111

I want et next data in new file:

1test
1111
2222
3333
4444
1test
5555
6666
2test
5555
4444
3test
0000
4test
9999
0000
5test
3333
3333

In shortly, I'm trying to select specific blocks from the source file.

My Code:

arr = []

data = File.read("/path/to/file")

blocks = ['1test','2test','3test','4test','5test']
blocks.each do |block|

want = data.match(/#{block}(.*)#{block}/m)[0]
want.each_line do |line|
  arr << line
  File.open("/path/to/result/file", 'w') { |file| file.write("#{res.join}") }
end

end

I think that my problem is that I read the data "want" many times. Is there a way to write to the result file in one pass of the "want" data?

Roman Kiselenko
  • 43,210
  • 9
  • 91
  • 103
Misha1991
  • 63
  • 1
  • 8

1 Answers1

3

Code

require 'set'

def save_blocks(fname_in, fname_out, *blocks)
  sblocks = blocks.to_set
  save = false
  File.open(fname_out, 'w') do |f|
    File.foreach(fname_in) do |line|
      lc = line.chomp
      save = sblocks.include?(lc) if lc =~ /\A\d+test\z/
      f.write(line) if save
    end
  end
end

Example

Let's first create a test file where str is the example string given in the question.

FNameIn  = "test.in"
FNameOut = "test.out"
File.write(FNameIn, str)
  #=> 135

We can check that.

puts File.read(FNameIn)
1test
1111
2222
...
3test
0000
4test
...
11test
1111

Now we execute the method.

save_blocks(FNameIn, FNameOut, "1test", "3test", "5test")

We can confirm the output file was written correctly.

puts File.read(FNameOut)
1test
1111
2222
3333
4444
1test
5555
6666
3test
0000
5test
3333
3333

I converted blocks to a set merely to speed the operation of include?. There is no need to explicitly close either file, as they are closed when their respective blocks return.

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100