Let's first construct your file.
str = <<~_
#
# Report
#---------------
Date header1 header2 header3 header4
20200 jdk;df 4543 $8333 4387
20200 jdk 5004 $945876 67
_
fin_name = 'in'
File.write(fin_name, str)
#=> 223
Two problems must be addressed to read this file using the method SmarterCSV::process. The first is that comments--lines beginning with an octothorpe ('#'
)--and blank lines must be skipped. The second is that the field separator is not a fixed-length string.
The first of these problems can be dealt with by setting the value of process
' :comment_regexp
option key to a regular expression:
:comment_regexp => /\A#|\A\s*\z/
which reads, "match an octothorpe at the beginning of the string (\A
being the beginning-of-string anchor) or (|
) match a string containing zero or more whitespace characters (\s
being a whitespace character and \z
being the end-of-string anchor)".
Unfortunately, SmarterCSV
is not capable of dealing with variable-length field separators. It does have an option :col_sep
, but it's value must be a string, not a regular expression.
We must therefore pre-process the file before using SmarterCSV
, though that is not difficult. While are are at, we may as well remove the dollar signs and use commas for field separators.1
fout_name = 'out.csv'
fout = File.new(fout_name, 'w')
File.foreach(fin_name) do |line|
fout.puts(line.strip.gsub(/\s+\$?/, ',')) unless
line.match?(/\A#|\A\s*\z/)
end
fout.close
Let's look at the file produced.
puts File.read(fout_name)
displays
Date,header1,header2,header3,header4
20200,jdk;df,4543,8333,4387
20200,jdk,5004,945876,67
Now that's what a CSV file should look like! We may now use SmarterCSV
on this file with no options specified:
SmarterCSV.process(fout_name)
#=> [{:date=>20200, :header1=>"jdk;df", :header2=>4543,
# :header3=>8333, :header4=>4387},
# {:date=>20200, :header1=>"jdk", :header2=>5004,
# :header3=>945876, :header4=>67}]
1. I used IO::foreach to read the file line-by-line and then write each manipulated line that is neither a comment nor a blank line to the output file. If the file is not huge we could instead gulp it into a string, modify the string and then write the resulting string to the output file: File.write(fout_name, File.read(fin_name).gsub(/^#.*?\n|^[ \t]*\n|^[ \t]+|[ \t]+$|\$/, '').gsub(/[ \t]+/, ','))
. The first regular expression reads, "match lines beginning with an octothorpe or lines containing only spaces and tabs or spaces and tabs at the beginning of a line or spaces and tabs at the end of a line or a dollar sign". The second gsub
merely converts multiple tabs and spaces to a comma.
File.new(fout_name, 'w')
File.foreach(fin_name) do |line|
fout.puts(line.strip.gsub(/\s+\$?/, ',')) unless
line.match?(/\A#|\A\s*\z/)
end
fout.close