0

I have a 2 GiB file, and I want to read the first line of the file. I can call the File#readlines method which returns array, and use [0] bracket syntax, at(0), or the slice(0) or first methods.

But there's a problem. My PC has 3.7 GiB RAM, and the usage goes from 1.1 GiB all the way up to 3.7 GiB. But all I want is the first line of the file. Is there an efficient way to do that?

pjs
  • 18,696
  • 4
  • 27
  • 56
15 Volts
  • 1,946
  • 15
  • 37
  • Does this answer your question? [How to get a particular line from a file](https://stackoverflow.com/questions/4014352/how-to-get-a-particular-line-from-a-file) – Nakilon Apr 01 '21 at 23:52
  • Uh no, imagine an exaggerated situation of having a hundred GB file. When you run `tail 100_GB_file`, tail will read only the last 10 lines or given lines. You don't essentially need to run billions of iterations and call `.next()` on `IO.foreach(file, splitter)`, or you can't read entire file in puny 8 GB RAM. I don't know if that's possible in Ruby. But I have solved this problem with Ruby C Extension, especially reading the file in C. This truly solves my problem: https://www.geeksforgeeks.org/implement-your-own-tail-read-last-n-lines-of-a-huge-file/ . But it isn't a true ruby solution... – 15 Volts Apr 02 '21 at 11:10
  • Good to know about your solution but your Question isn't about tail. – Nakilon Apr 02 '21 at 12:22

5 Answers5

1

get from https://www.rosettacode.org/wiki/Read_a_specific_line_from_a_file#Ruby

 seventh_line = open("/etc/passwd").each_line.take(7).last
TorvaldsDB
  • 766
  • 9
  • 8
0

have you tried readline instead of readlines?

File.open('file-name') { |f| f.readline }
Yoav Epstein
  • 849
  • 9
  • 7
  • Yes, it consumes a good amount of memory! Only `readpartial` doesn't eat up that much... – 15 Volts Aug 04 '19 at 23:34
  • Because the file contains ASCII text, I can do this: `ch = ''.tap { |a| File.open('hello.txt') { |x| loop until a.concat(x.readpartial(1))[-1] == ?\n } }` for now, without causing any memory issues... This will read the first line. But if the first line contains a newline character, it will take that empty line. `strip` can be used to strip off the extra leading spaces or new lines. The answer still causes memory problem, but thanks for attempting to answer. – 15 Volts Aug 04 '19 at 23:42
0

What about IO.foreach?

IO.foreach('filename') { |line| p line; break }

That should read the first line, print it, and then stop. It does not read the entire file; it reads one line at a time.

anothermh
  • 9,815
  • 3
  • 33
  • 52
  • Thanks, `IO.foreach('hello.txt').first` works flawlessly! Or `IO.foreach('hello.txt').take(2).to_a[1]` for getting the second line... – 15 Volts Aug 04 '19 at 23:50
0

I would use commands line. For example, in this way:

exec("cat #{filename} | head -#{nth_line} | tail -1")

I hope it useful for you.

Sandra
  • 358
  • 3
  • 5
  • Thanks for answering. But it's terrible choice to use shell inside Ruby. I always try to avoid that. You are calling a separate binary. Also, one problem is that it's not a Ruby way. Your system with Ruby will have the IO and the File class. But your system may be missing cat! The other thing is that calling binaries is slow. I have benchmarked the `clear` method and `print "\e[2J\e[H\e[3J"`. Both do the same job, but the ANSI one is a 100k times faster. I would use these things only for MRuby but my question is for general Ruby or MRI. Sorry but -1 for that... – 15 Volts Aug 05 '19 at 13:36
  • @S.Goswami I would remove the downvote. [Use your downvotes whenever you encounter an egregiously sloppy, no-effort-expended post, or an answer that is clearly and perhaps dangerously incorrect.](https://stackoverflow.com/help/privileges/vote-down) This answer is functionally correct, even if it isn't optimal or perfect for your use-case, and would in many circumstances work exactly as someone would expect. – anothermh Aug 09 '19 at 21:03
  • @anothermh, I got you, but your program in such case depends on cat and head. You don't need to do that because Ruby has everything built in for you. It's helpful for those using MRuby. For example, `IO.foreach` is available on Linux, Windows, and Mac, and Android, and whatnot, but if you follow the answerer, you are left with Linux / Unix... And also `exec(...)` will cause your program to exit after the commands are executed... Yes, it's another possibility to call shell, but if you do a benchmark of reading a gig file and read the first line 100K times, you will surely know the difference! – 15 Volts Aug 10 '19 at 04:12
  • @anothermh, this is correct answer, but it's not Ruby right? You can use Perl / Python / Lua etc. inside of `Kernel#\`\`` / `exec` / `Kernel#system` / `IO#popen` etc. instead of the BASH script, which will be slower but will work. That's why I think there's not much effort given to write the answer. It's simply not thinking in the Ruby way... – 15 Volts Aug 10 '19 at 04:23
  • We can choose to disagree. But I’d remind you that there’s a difference between inefficient and incorrect, and between “correct for some platforms but not others” and incorrect. – anothermh Aug 10 '19 at 04:33
0

So I have came with a code that does the job quite efficiently.

Firstly, we can use the IO#each_line method. Say we need the line at 3,000,000:

#!/usr/bin/ruby -w

file = File.open(File.join(__dir__, 'hello.txt'))
final = nil
read_upto = 3_000_000 - 1

file.each_line.with_index do |l, i|
    if i == read_upto
        final = l
        break
    end
end

file.close
p final

Running with the time shell builtin:

[I have a big hello.txt file with #!/usr/bin/ruby -w #lineno in it!!]

$ time ruby p.rb
"#!/usr/bin/ruby -w #3000000\n"

real    0m1.298s
user    0m1.240s
sys 0m0.043s

We can also get the 1st line very easily! You got it...

Secondly, extending anothermh's answer:

#!/usr/bin/ruby -w

enum = IO.foreach(File.join(__dir__, 'hello.txt'))

# Getting the first line
p enum.first

# Getting the 100th line
# This can still cause memory issues because it
# creates an array out of each line
p enum.take(100)[-1]

# The time consuming but memory efficient way
# reading the 3,000,000th line
# While loops are fastest

index, i = 3_000_000 - 1, 0
enum.next && i += 1 while i < index
p enum.next    # reading the 3,000,000th line

Running with time:

time ruby p.rb 
"#!/usr/bin/ruby -w #1\n"
"#!/usr/bin/ruby -w #100\n"
"#!/usr/bin/ruby -w #3000000\n"

real    0m2.341s
user    0m2.274s
sys 0m0.050s

There could be other ways like the IO#readpartial, IO#sysread and so on. But The IO.foreach, and IO#each_line are the easiest and quite fast to work with.

Hope this helps!

15 Volts
  • 1,946
  • 15
  • 37