I have a file that contains chunks of text separated by blank lines, like this:
block 1
some text
some text
block 2
some text
some text
How can I read it into an array?
I have a file that contains chunks of text separated by blank lines, like this:
block 1
some text
some text
block 2
some text
some text
How can I read it into an array?
This is asked often enough that I thought it'd be useful to explain what to do, but first I need to say this:
Don't try to read a file in one big gulp. That's called "slurping", and is a bad idea unless you can guarantee that you'll ALWAYS get files significantly less than 1MB in size. See "Why is "slurping" a file not a good practice?" for more information.
If I have a file that looks like:
block 1
some text
some text
block 2
some text
some text
and I tried to read it normally, I'd get something like:
File.read('foo.txt')
#=> "block 1\nsome text\nsome text\n\nblock 2\nsome text\nsome text\n"
which would leave me having to split it into separate lines, trying to find the blank lines, and then break it into chunks. And, invariably, the naive solution would be to use a regular expression, which kind-a works but it's not optimal.
Or we could try:
File.readlines('foo.txt')
#=> ["block 1\n", "some text\n", "some text\n", "\n", "block 2\n", "some text\n", "some text\n"]
and then still have to find the blank lines and turn the array into sub-arrays.
Instead, there are two easy ways to load the file.
Keeping in mind the previous warning about slurping files, if it's a small file we can use:
File.readlines('foo.txt', "\n\n")
#=> ["block 1\nsome text\nsome text\n\n", "block 2\nsome text\nsome text\n"]
Notice the use of "\n\n"
in the second parameter. That's the "line separator", which normally is defined as "\n" for *nix-type OSes and "\r\n" for Windows. It's actually based on an OS-derived global value Ruby sets known affectionately as $/
, $RS
or $INPUT_RECORD_SEPARATOR
. They're documented in the English module. A record separator is the character used in a text file to separate two lines, or, for our purposes a group of lines separated by two line-end characters, or, in other words, a paragraph.
Once read, it's easy to clean up the contents to remove the trailing line-ends:
File.readlines('foo.txt', "\n\n").map(&:rstrip)
#=> ["block 1\nsome text\nsome text", "block 2\nsome text\nsome text"]
Or break them into sub-arrays:
File.readlines('foo.txt', "\n\n").map{ |s| s.rstrip.split("\n") }
#=> [["block 1", "some text", "some text"], ["block 2", "some text", "some text"]]
All the examples could be used with a paragraph similar to:
File.readlines('foo.txt', "\n\n").map(&:rstrip).each do |line|
# do something with line
end
or:
File.readlines('foo.txt', "\n\n").map{ |s| s.rstrip.split("\n") }.each do |paragraph|
# do something with the sub-array `paragraph`
end
If it's a big file, we can use Ruby's line-by-line IO via foreach
if the file isn't already open, or each_line
if it's an already opened file. And, since you read the link above, you already know why we'd want to use line-by-line IO.
File.foreach('foo.txt', "\n\n") #=> #<Enumerator: File:foreach("foo.txt", "\n\n")>
foreach
returns an enumerator so we need to tack on to_a
to read the array so we can see the results, but normally we'd not have to do that:
File.foreach('foo.txt', "\n\n").to_a
#=> ["block 1\nsome text\nsome text\n\n", "block 2\nsome text\nsome text\n"]
It's easy to use foreach
like above:
File.foreach('foo.txt', "\n\n").map(&:rstrip)
#=> ["block 1\nsome text\nsome text", "block 2\nsome text\nsome text"]
File.foreach('foo.txt', "\n\n").map(&:rstrip).map{ |s| s.rstrip.split("\n") }
#=> [["block 1", "some text", "some text"], ["block 2", "some text", "some text"]]
Note: I strongly suspect using map
like that will cause a similar problem as slurping the file, since Ruby will buffer the output of foreach
before passing it to map
. Instead we need to do the manipulation of each paragraph read inside the do
block:
File.foreach('foo.txt', "\n\n") do |ary|
ary.rstrip.split("\n").each do |line|
# do something with the individual line
end
end
Doing that is a small hit in performance but, because the goal is to process by paragraphs or blocks, it's acceptable.
Also note, this is a community Wiki, so edit and contribute appropriately.