0

I have a CSV file that contains line breaks inside quotation marks. I would like to get rid of those, (and replace them to \ ) so as to be able to CSV.parse line-by-line.

My original is a string containing

"a","b",c,"d
e",f,g,"h
i",j
k,"l","m
n","o"

and I would like to effectively be parsing a string containing :

"a","b",c,"d e",f,g,"h i",j
k,"l","m n","o" 

How to do that in Ruby ?

An effective and down-to-earth solution thanks to user @sln

fichier = File.open ("baz.csv")

matchesBalancedLinesFromUser_sln = /^[^"]*(?:"[^"]*"[^"]*)*$/

mem = ""
fichier.each_line do |ligne| 
  mem += ligne.delete("\n") # as long as we don't have balance for 
                            # the quotations marks, we cat the lines 
  if mem =~ matchesBalancedLinesFromUser_sln
    ligneReplaced = mem + "\n"
    doWhatYouWill(ligneReplaced)
    mem = ""
  end 
end

fichier.rewind

Another way to do it without a regex, just counting the quotation marks

fichier = File.open ("baz.csv")

def doWhatYouWill (string)
  puts string
end

mem = ""
fichier.each_line do |ligne| 
  mem += ligne.strip + " " # as long as we don't have balance for 
                           # the quotations marks, we cat the lines 
  if mem.scan(/"/).count.even? # if mem has even quotation marks
    ligneReplaced = mem + "\n"
    doWhatYouWill(ligneReplaced)
    mem = ""
  end 
end

fichier.rewind

Note This solution assumes that the CSV file is valid in its balance of quotation marks. If this is not the case, see this comment by User @sln

Community
  • 1
  • 1
marsupilam
  • 127
  • 2
  • 7
  • 1
    Usually linebreaks represent end-of-record. Anyway, can you post some sample text. And, if you use a regex to parse that, why try to fix it for another csv parser, just use the data directly. –  Jul 22 '16 at 18:48
  • @sln Are you suggesting I should just write my own CSV parser ? I am not at that point in terms of technical ability yet... – marsupilam Jul 22 '16 at 18:53
  • 2
    Line breaks in quoted fields are entirely valid in CSV and Ruby's CSV library handles them correctly (it defaults to double-quotes but you can specify other characters with the `:quote_char` option). Please edit your question to include a minimal example of the data that's causing the error as well as the code you're using. – Jordan Running Jul 22 '16 at 18:55
  • You should test the output string with this `^[^"]*(?:"[^"]*"[^"]*)*$` sounds like it's an uneven/unbalanced problem, not an newline in quoted fields. –  Jul 22 '16 at 19:04
  • But, yes using a regex for csv parsing is fairly easy, especially since most general use modules can't do the more intricate stuff. –  Jul 22 '16 at 19:06
  • @sIn I am not attempting or claiming to do anything intricate. My interest in this question is as a Ruby-mere-user summer vacation project. – marsupilam Jul 22 '16 at 19:10
  • Please don't use "edit" or "update" tags in your question (or an answer). Instead, insert the changes where you would have initially. We can see when/where things changed if we want to. Also, don't use salutations, valedictions or signatures. SO isn't a discussion board, it's a reference book. Also, sticking to the details is better than being overly chatty; Again, SO isn't a discussion. – the Tin Man Jul 22 '16 at 19:24
  • It's better to read the file line-by-line than to `read` (AKA "slurp") it. Using a regular `foreach` would allow you to look for lines that were malformed, fix them, then submit the line to CSV to process it. The alternative is to pre-process the file to fix the problems, then reread it using CSV. Also, please read "[ask]" including the linked pages, and "[mcve]". You need to show the minimum code necessary to demonstrate the problem. – the Tin Man Jul 22 '16 at 19:28
  • Well, I was in the middle of posting a regex for ya, but got shut down, thanks a lot. Use this regex `\G((?:^|,)\s*)(?:("[^"\\]*(?:\\[\S\s][^"\\]*)*")|([^,]*))(?:(\s*)(?:(?=,)|$))` in a replace all with a _callback_. In the callback, a simple replacement is `$1 + substituteNewline( $2 ) + $3 + $4` and you're done.. And, always validate even quotes ahead of time, use the regex I posted above. –  Jul 22 '16 at 19:35
  • As an example, running your sample text using my regex with a simple replace of `$1$3$4` with no newline processing of `$2` yields `,,c,,f,g,,j \n k,,,` Or, you can run to the field of dreams in _How do I robustly parse malformed CSV?_ –  Jul 22 '16 at 19:48
  • This works, as I cannot answer... `fichier = File.open ("foo.csv") mem = "" fichier.each_line do |ligne| mem += ligne.delete("\n") # as long as we don't have balance for "", we cat the lines unless mem=~ /^([^"]*("[^"]*")*)*"[^"]*$/ #matches unbalanced lines` puts mem + "\n" mem = "" end end fichier.rewind – marsupilam Jul 22 '16 at 20:01
  • @marsupilam - Yeah, that would work, but there is no guarantee the whole file is balanced. Also, nobody will know what you're doing. And, its slow as toothpaste. I'd slurp the whole thing into a string, validate balanced quotes, do the replace with callback, then write it out. It is the fastest way. Also, use Perl, not some substitute language. –  Jul 22 '16 at 20:26
  • @sln Yeah that remark about Perl and Ruby is funny. But I really am a casual user, and Perl just makes my head ache. Thanks for your useful help, and sorry if you felt dissed (english is a foreign language to me) ! – marsupilam Jul 22 '16 at 20:59

0 Answers0