1

I am using Smarter CSV to and have encountered a csv that has blank lines. Is there anyway to ignore these? Smarter CSV is taking the blank line as a header and not processing the file correctly. Is there any way I can bastardize the comment_regexp?

mail.attachments.each do | attachment |
        filename = attachment.filename
        #filedata = attachment.decoded
        puts filename 
        begin
          tmp = Tempfile.new(filename)
          tmp.write attachment.decoded
          tmp.close
          puts tmp.path
          f = File.open(tmp.path, "r:bom|utf-8")
          options = {
            :comment_regexp => /^#/
          }
          data = SmarterCSV.process(f, options)
          f.close 
          puts data 

Sample File:

[test.csv[1]

output

enter image description here

i cant code
  • 305
  • 2
  • 7
  • did you checked documentation? maybe this can help you https://github.com/tilo/smarter_csv/blob/c0b804bfc5c4623ec9a1d75adb2ecefc9e96eee1/lib/smarter_csv/smarter_csv.rb there is a sample handling header line. Maybe this can guide you – Nezir Jan 10 '20 at 13:48
  • could you post a sample of your CSV data? – microspino Jan 10 '20 at 14:16
  • Obviously, you could pre-process the file to remove the non-blank lines: `f = File.open(file_out, 'w'); IO.foreach(file_in) { |line| f.puts(line) unless line.empty? }; f.close`. – Cary Swoveland Jan 10 '20 at 19:46
  • Please do not post pictures of data (or data). Anyone wanting to run code with your data should be able to simply cut-and-paste. You are forcing every reader who wants to do that to manually convert your picture of data to text. Moreover, your picture does not tell us whether tabs are present. Please replace the pictures with text. Before displaying your data you referred to a “csv”, the “c” meaning “comma”, the default field separator. It now appears that fields are separated by one or more spaces and/or tabs. That needs to be stated. – Cary Swoveland Jan 13 '20 at 15:55
  • ok, understood. Sorry i didn't know. – i cant code Jan 14 '20 at 11:43

1 Answers1

2

Let's first construct your file.

str = <<~_
#
# Report
#---------------
Date              header1           header2  header3      header4
        20200 jdk;df           4543 $8333              4387       

        20200 jdk              5004 $945876              67

_

fin_name = 'in'
File.write(fin_name, str)
  #=> 223

Two problems must be addressed to read this file using the method SmarterCSV::process. The first is that comments--lines beginning with an octothorpe ('#')--and blank lines must be skipped. The second is that the field separator is not a fixed-length string.

The first of these problems can be dealt with by setting the value of process' :comment_regexp option key to a regular expression:

:comment_regexp => /\A#|\A\s*\z/

which reads, "match an octothorpe at the beginning of the string (\A being the beginning-of-string anchor) or (|) match a string containing zero or more whitespace characters (\s being a whitespace character and \z being the end-of-string anchor)".

Unfortunately, SmarterCSV is not capable of dealing with variable-length field separators. It does have an option :col_sep, but it's value must be a string, not a regular expression.

We must therefore pre-process the file before using SmarterCSV, though that is not difficult. While are are at, we may as well remove the dollar signs and use commas for field separators.1

fout_name = 'out.csv'

fout = File.new(fout_name, 'w')
File.foreach(fin_name) do |line|
  fout.puts(line.strip.gsub(/\s+\$?/, ',')) unless 
    line.match?(/\A#|\A\s*\z/)
end
fout.close

Let's look at the file produced.

puts File.read(fout_name)

displays

Date,header1,header2,header3,header4
20200,jdk;df,4543,8333,4387
20200,jdk,5004,945876,67

Now that's what a CSV file should look like! We may now use SmarterCSV on this file with no options specified:

SmarterCSV.process(fout_name)
  #=> [{:date=>20200, :header1=>"jdk;df", :header2=>4543,
  #     :header3=>8333, :header4=>4387},
  #    {:date=>20200, :header1=>"jdk", :header2=>5004,
  #     :header3=>945876, :header4=>67}]

1. I used IO::foreach to read the file line-by-line and then write each manipulated line that is neither a comment nor a blank line to the output file. If the file is not huge we could instead gulp it into a string, modify the string and then write the resulting string to the output file: File.write(fout_name, File.read(fin_name).gsub(/^#.*?\n|^[ \t]*\n|^[ \t]+|[ \t]+$|\$/, '').gsub(/[ \t]+/, ',')). The first regular expression reads, "match lines beginning with an octothorpe or lines containing only spaces and tabs or spaces and tabs at the beginning of a line or spaces and tabs at the end of a line or a dollar sign". The second gsub merely converts multiple tabs and spaces to a comma.

File.new(fout_name, 'w') File.foreach(fin_name) do |line| fout.puts(line.strip.gsub(/\s+\$?/, ',')) unless line.match?(/\A#|\A\s*\z/) end fout.close

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
  • Much appreciated, I'll try this tomorrow and will let you know. Thanks for your help. – i cant code Jan 12 '20 at 21:51
  • I've updated the original post with sample data from the csv file. I used IRB to show what happening. From what I can tell it's skipping the # lines fine but then just processing column A.. I've tried the following but get the same result. `t = File.open('test.csv', "r:bom|utf-8")` and execute with `test = SmarterCSV.process("test.csv", {:comment_regexp => /^#/} )` – i cant code Jan 13 '20 at 12:29