Here is a Ruby solution that reads the file line-by-line. At the end I show how much simpler the solution could be if the file could be gulped into a string.
Let's first create an input file to work with.
str =<<~_
apple|pear
apple|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson
cherry|ruddy
cherry|cerise
_
file_name_in = 'file_in'
File.write(file_name_in, str)
#=> 112
Solution when file is read line-by-line
We can produce the desired output file with the following method.
def doit(file_name_in, file_name_out)
fin = File.new(file_name_in, "r")
fout = File.new(file_name_out, "w")
str = ''
until fin.eof?
s = fin.gets.strip
k,v = s.split(/(?=\|)/)
if str.empty?
str = s
key = k
elsif k == key
str << v
else
fout.puts(str)
str = s
key = k
end
end
fout.puts(str)
fin.close
fout.close
end
Let's try it.
file_name_out = 'file_out'
doit(file_name_in, file_name_out)
puts File.read(file_name_out)
prints the following.
apple|pear|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson|ruddy|cerise
Note that
"apple|pear".split(/(?=\|)/)
#=> ["apple", "|pear"]
The regular expression contains the positive lookahead (?=\|)
which matches the zero-width location between 'e'
and '|'
.
Solution when file is gulped into a string
The OP does not want to gulp the file into a string (hence my solution above) but I would like to show how much simpler the problem is if one could do so. Here is one of many ways of doing that.
def gulp_it(file_name_in, file_name_out)
File.write(file_name_out,
File.read(file_name_in).gsub(/^(.+)\|.*[^ ]\K *\r?\n\1/, ''))
end
gulp_it(file_name_in, file_name_out)
#=> 98
puts File.read(file_name_out)
prints
apple|pear|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson|ruddy
cherry|cerise
Thinking about what the regex engine will be doing, this may be acceptably fast, depending on file size, of course.
Regex demo
While the link uses the PCRE engine the result would be the same using Ruby's regex engine (Onigmo). We can make the regular expression self-documenting by writing it in free-spacing mode.
/
^ # match the beginning of a line
(.+) # match one or more characters
\|.*[^ ] # match '|', then zero or more chars, then a non-space
\K # resets the starting point of the match and discards
# any previously-matched characters
[ ]* # match zero or more chars
\r?\n # march the line terminator(s)
\1 # match the content of capture group 1
/x # invoke free-spacing mode
(.+)
matches, 'apple'
, 'banana'
and 'cherry'
because those words are at the beginning lines. One could alternatively write ([^|]*)
.