4

The incoming data file(s) contain malformed CSV data such as non-escaped quotes, as well as (valid) CSV data such as fields containing new lines. If a CSV format error is detected I would like to use an alternative routine on that data.

With the following sample code (abbreviated for simplicity)

FasterCSV.open( file ){|csv|
  row = true
  while row
    begin
      row = csv.shift
      break unless row
      # Do things with the good rows here...

    rescue FasterCSV::MalformedCSVError => e
      # Do things with the bad rows here...
      next
    end
  end
}

The MalformedCSVError is caused in the csv.shift method. How can I access the data that caused the error from the rescue clause?

s01ipsist
  • 3,022
  • 2
  • 32
  • 36

3 Answers3

8
require 'csv' #CSV in ruby 1.9.2 is identical to FasterCSV

# File.open('test.txt','r').each do |line|
DATA.each do |line|
  begin
    CSV.parse(line) do |row|
      p row #handle row
    end
  rescue  CSV::MalformedCSVError => er
    puts er.message
    puts "This one: #{line}"
    # and continue
  end
end

# Output:

# Unclosed quoted field on line 1.
# This one: 1,"aaa
# Illegal quoting on line 1.
# This one: aaa",valid
# Unclosed quoted field on line 1.
# This one: 2,"bbb
# ["bbb", "invalid"]
# ["3", "ccc", "valid"]   

__END__
1,"aaa
aaa",valid
2,"bbb
bbb,invalid
3,ccc,valid

Just feed the file line by line to FasterCSV and rescue the error.

Jordan Running
  • 102,619
  • 17
  • 182
  • 182
steenslag
  • 79,051
  • 16
  • 138
  • 171
  • 1
    This was my first instinct as well, but as noted in my answer I'm afraid this will equate to a pretty big performance hit, as `FasterCSV.parse` instantiates a new FasterCSV object every time. FasterCSV's constructor isn't especially hefty but if you have to do it for every line in, say, a 10-million-line file it adds up fast. – Jordan Running Oct 06 '11 at 08:10
  • How would this handle the valid CSV rows that contain new lines? – s01ipsist Oct 06 '11 at 08:55
  • That's two lines, @steenslag, and will be split up by `File#each`. – Jordan Running Oct 06 '11 at 09:24
  • @Jordan Are we talking about the same thing? ( added sample data and output to code ). – steenslag Oct 06 '11 at 09:46
  • @steenslag Your `DATA` section has literal \ and n characters, not newline control characters (you're not in Ruby anymore, remember--after `__END__` a backslash is just a backslash). I've updated your code (hope you don't mind, feel free to revert it) to show the actual use case and output. The two lines immediately following `__END__` constitute a single (valid) "row" in CSV. – Jordan Running Oct 06 '11 at 09:58
  • I ended up developing on this concept to "fix" the malformed CSV data into a clean CSV file. Certain common errors (to this data set) such as unquoted "s (as long as there aren't quoted ,s too) can be detected. It's not pretty, or fast, but it does the job... – s01ipsist Oct 18 '11 at 05:09
2

This is going to be really difficult. Some things that make FasterCSV, well, faster, make this particularly hard. Here's my best suggestion: FasterCSV can wrap an IO object. What you could do, then, is to make your own subclass of File (itself a subclass of IO) that "holds onto" the result of the last gets. Then when FasterCSV raises an exception you can ask your special File object for the last line. Something like this:

class MyFile < File
  attr_accessor :last_gets
  @last_gets = ''

  def gets(*args)
    line = super
    @last_gets << $/ << line
    line
  end
end

# then...

file  = MyFile.open(filename, 'r')
csv   = FasterCSV.new file

row = true
while row
  begin
    break unless row = csv.shift

    # do things with the good row here...

  rescue FasterCSV::MalformedCSVError => e
    bad_row = file.last_gets

    # do something with bad_row here...

    next
  ensure
    file.last_gets = '' # nuke the @last_gets "buffer"
  end
end

Kinda neat, right? BUT! there are caveats, of course:

  1. I'm not sure how much of a performance hit you take when you add an extra step to every gets call. It might be an issue if you need to parse multi-million-line files in a timely fashion.

  2. This fails utterly might or might not fail if your CSV file contains newline characters inside quoted fields. The reason for this is described in the source--basically, if a quoted value contains a newline then shift has to do additional gets calls to get the entire line. There could be a clever way around this limitation but it's not coming to me right now. If you're sure your file doesn't have any newline characters within quoted fields then this shouldn't be a worry for you, though.

Your other option would be to read the file using File.gets and pass each line in turn to FasterCSV#parse_line but I'm pretty sure in so doing you'd squander any performance advantage gained from using FasterCSV.

Jordan Running
  • 102,619
  • 17
  • 182
  • 182
  • Indeed, as mentioned, these files may contain quoted newline characters. I can avoid the unescaped quotes errors by using a FasterCSV parameter :quote_char => '|', but this chokes on the newlines... – s01ipsist Oct 06 '11 at 08:58
  • Right you are. I did come up with a partial solution, which I've changed my code to reflect. Instead of having a new `@last_gets` on every `File#gets`, append the last chunk to it every time, then reset it only after a row is successfully retrieved *or* the exception is thrown. No it's still not a complete solution but I hope it gives you an idea or two. – Jordan Running Oct 06 '11 at 09:39
1

I used Jordan's file subclassing approach to fix the problem with my input data before CSV ever tries to parse it. In my case, I had a file that used \" to escape quotes, instead of the "" that CSV expects. Hence,

class MyFile < File
  def gets(*args)
    line = super
    if line != nil
      line.gsub!('\\"','""')  # fix the \" that would otherwise cause a parse error
    end
    line
  end
end

infile = MyFile.open(filename)
incsv = CSV.new(infile)

while row = infile.shift
  # process each row here
end

This allowed me to parse the non-standard CSV file. Ruby's CSV implementation is very strict and often has trouble with the many variants of the CSV format.

Jonathan Stray
  • 2,708
  • 2
  • 26
  • 32