0

The regular expression I am looking for have to be able to deal with different patterns.

Those are the 3 different patterns.

"10.1234/altetric55,Awesome Steel Chair,1011-2513"
"\"Sporer, Kihn and Turner\",2885-6503"
"Bartell-Collins,1167-8230"

I will have to pass this regular expression to a ruby split method.

line.split(/regular_expression/)

The idea is to split the test when there is a comma except (like in the second expression) if the comma is part of the text

thanks

Gerard Morera
  • 799
  • 6
  • 12
  • See [Regex to pick commas outside of quotes](http://stackoverflow.com/questions/632475/regex-to-pick-commas-outside-of-quotes). It should solve your issue. – Wiktor Stribiżew Nov 02 '15 at 21:09
  • 1
    Please show your desired output for each of the three strings. – Cary Swoveland Nov 02 '15 at 21:17
  • 1
    What is wrong with CSV parser? See [this IDEONE demo](http://ideone.com/GLc8cq) or [this one](http://ideone.com/uEnyYb). – Wiktor Stribiżew Nov 02 '15 at 21:33
  • @stribizhev the expected output is ["10.1234/altetric55", "Awesome Steel Chair", "1011-2513] ["Sporer, Kihn and Turner", "2885-6503"] ["Bartell-Collins", "1167-8230"] – Gerard Morera Nov 02 '15 at 21:34
  • 1
    Using Ruby's built-in [CSV](http://ruby-doc.org/stdlib-2.2.3/libdoc/csv/rdoc/index.html) class is my recommendation. It's designed to handle the sort of comma-separated-values you show, including those with embedded commas inside quotes. Don't try to do it with a regex, instead rely on the pre-written, well-tested code. – the Tin Man Nov 02 '15 at 21:57
  • I found another possible duplicate original: [Ruby on Rails - Import Data from a CSV file](http://stackoverflow.com/questions/4410794/ruby-on-rails-import-data-from-a-csv-file). – Wiktor Stribiżew Nov 02 '15 at 22:01
  • The same question is [here](http://stackoverflow.com/questions/32322875/how-could-i-split-commas-excepts-its-in-double-quotes). However, none of the answers there employ a (correct) regex, whereas @Casimir has offered one here, so I would advise against closing this question as a dup of the above-mentioned one. The selected answer there employs the `CSV` module (which makes sense), but I welcomed the opportunity of using Ruby's somewhat obscure `flip-flop` operator. – Cary Swoveland Nov 02 '15 at 23:02

2 Answers2

2

In this case, don't try to split on each commas that is not enclosed between quotes. Try to find all that is not a comma or content between quotes with this pattern:

"10.1234/altetric55,Awesome Steel Chair,1011-2513".scan(/[^,"]*(?:"[^"\\]*(?:\\.[^"\\]*)*"[^,"]*)*/)

or to avoid empty items:

"10.1234/altetric55,Awesome Steel Chair,1011-2513".scan(/[^,"]+(?:"[^"\\]*(?:\\.[^"\\]*)*"[^,"]*)*|(?:"[^"\\]*(?:\\.[^"\\]*)*")+/)

But you can avoid these complex questions using the CSV class:

require 'csv'
CSV.parse("\"Sporer, Kihn and Turner\",2885-6503")
=> [["Sporer, Kihn and Turner", "2885-6503"]] 
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Did you reopen the question? Why? – Wiktor Stribiżew Nov 02 '15 at 21:07
  • @stribizhev: Yes I do, because all solutions in the answers of the linked question are `,(?=stupid pattern to know if I am not between quotes until the end of the string)` (that stops to work if the string is a bit long). – Casimir et Hippolyte Nov 02 '15 at 21:10
  • You have answered almost the same question a day or two ago, why not link to YOUR answer? The questions like this are annoying, there must be a good duplicate original. – Wiktor Stribiżew Nov 02 '15 at 21:11
  • @CasimiretHippolyte `("10.1234/altetric55,Awesome Steel Chair,1011-2513").split(/[^,"]*(?:"[^"\\]*(?:\\.[^"\\]*)*"[^,"]*)*/ => /[^,"]*(?:"[^"\\]*(?:\\.[^"\\]*)*"[^,"]*)*/ pry(#)> => ["", ",", ","]` Do I miss something? – Gerard Morera Nov 02 '15 at 21:45
  • Using CSV parser is my suggestion. – Wiktor Stribiżew Nov 02 '15 at 21:47
  • @stribizhev: sorry, but I suggested this way independently to your comments. But to comfort you, I will upvote one of your answers. – Casimir et Hippolyte Nov 02 '15 at 21:51
  • @stribizhev Yes, CSV parser is fantastic! I change the title of the post to make it more accesible to others. Thank you very much – Gerard Morera Nov 02 '15 at 21:55
  • Using CSV is the appropriate way of parsing the strings. A pattern only increases the maintenance problem, whereas CSV is well tested and handles all sorts of weird cases, plus provides a lot of added flexibility. – the Tin Man Nov 02 '15 at 21:59
  • @GerardMorera: yes, you miss something. As I said, you didn't use the good method. About the regex way, instead of using `split`, you must use `scan`. The csv class seems to be the best option for your case, however I don't always follow the religion of "well tested/already-written code/library/gem/module" that doesn't always fit the requirements or that is not always as flexible as we hope when requirements change a little. – Casimir et Hippolyte Nov 02 '15 at 22:21
  • If `str` is the second string in the question and your regex is `r`, I get `str.split(r) #=> ["", ","]`. Did you intend it to be used differently? – Cary Swoveland Nov 02 '15 at 23:07
0

Here's another way, using recursion:

def split_it(str)
  outside_quotes = true
  pos = str.size.times.find do |i|
    case str[i]
    when '"'
      outside_quotes = !outside_quotes
      false
    when ','
      outside_quotes
    else false
    end
  end
  ret = pos ? [str[0,pos], *split_it(str[pos+1..-1])] : [str]
end

["10.1234/altetric55,Awesome Steel Chair,1011-2513",
"\"Sporer, Kihn and Turner\",2885-6503\",,,3\"",
"Bartell-Collins,1167-8230"].map { |s| split_it(s) }
  #=> [["10.1234/altetric55", "Awesome Steel Chair", "1011-2513"],
  #    ["\"Sporer, Kihn and Turner\"", "2885-6503\",,,3\""],
  #    ["Bartell-Collins", "1167-8230"]]
Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100