2

I am trying to split a csv string with comma as delimiter.

val string ="A,B,"Hi,There",C,D"

I cannot use string.split(",") because it will split "Hi,There" as two different columns. Can I use regex to solve this? I came around scala-csv parser which I dont want to use. I hope there is a better method to solve this problem.I know this is not a trivial problem. It'll be helpful if people can share their approaches to solve this problem.

Prince Bhatti
  • 4,671
  • 4
  • 18
  • 24
  • 2
    CSV parsing is not trivial as you realized. So it is probably too broad for this format to provide a complete answer, asking for tools is off-topic on SO, so I do not think this question is a good fit for this site. – Gábor Bakos Jun 19 '15 at 19:24
  • @Gábor You got me wrong. I don't want to use any tool. I want people to share there logic or any good method here. I can use parser's like `scala-csv` for this task.But, I want an open logical way to approach this problem. I think this question will help many if people share there approaches here. – Prince Bhatti Jun 19 '15 at 19:50
  • 2
    @COSTA The logical way to approach this is to use an established library because csv parsing is amazingly non-trivial. – Daenyth Jun 19 '15 at 19:58
  • If you're looking for an answer to go into how to make a CSV parser yourself, then the question should be edited to make that clear and to show what you've already tried and how it didn't work – Daenyth Jun 19 '15 at 20:04
  • 1
    Maybe related: http://stackoverflow.com/questions/18144431/regex-to-split-a-csv – Daenyth Jun 19 '15 at 20:05
  • If you want to write a parser yourself, an RFC for CSV is here: https://tools.ietf.org/html/rfc4180 (though not all CSVs will conform to this spec). You can write an LL(1) parser for it: https://en.wikipedia.org/wiki/LL_parser – Martijn Jun 19 '15 at 22:04
  • 1
    With Scala's parser combinators API, you could write a parser compliant with [RFC 4180](http://tools.ietf.org/html/rfc4180#page-2) with less than 30 LOC. But as already stated, you can use an existing parser to make your life easier. – Alexis C. Jun 20 '15 at 07:57
  • Thanks for sharing the RFC. This [link](http://stackoverflow.com/questions/5063022/use-scala-parser-combinator-to-parse-csv-files?rq=1) here, is also helpful. This much info is enough to have the basic understanding of a parser. Now i m on to make a new parser. – Prince Bhatti Jun 20 '15 at 17:50

3 Answers3

3

I agree with Jeronimo Backes, csv parsing is not trivial and it much better to use a library rather than reinvent the wheel.

Besides uniVocity-parsers there are some other more scala orientated libraries available (underlying parser indicated):

product-collections, my own project, is well tested against the same data as univocity and also against csv spectrum. It is strongly typed, reflection free and compatible with scala-js. It's tested for performance against most of the java equivalents.

The other projects listed all have their strengths. Scala-csv is very difficult to call from java without shims, so although I've tested it internally I was not able to make a pull request against csv-parsers-comparison.

Product-collections used to leverage opencsv but in order to make it scala-js compatible it now contains a native parser. The parser performs better than opencsv (speed, correctness) in all the scenarios I tested.

Mark Lister
  • 1,103
  • 6
  • 16
2

Use uniVocity-parsers CsvParser for that instead of parsing it by hand. CSV is much harder than you think and there are many corner cases to cover. You just found one. In short, you NEED a library to read CSV reliably. uniVocity-parsers is used by other Scala projects (e.g. spark-csv)

I'll put an example using plain Java here, because I don't know Scala, but you'll get the idea:

public static void main(String ... args){
    CsvParserSettings settings = new CsvParserSettings(); //many options here, check the documentation
    CsvParser parser = new CsvParser(settings);
    String[] row = parser.parseLine("A,B,\"Hi,There\",C,D");
    for(String value : row){
        System.out.println(value);
    }
}

Output:

A
B
Hi,There
C
D

Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

josliber
  • 43,891
  • 12
  • 98
  • 133
Jeronimo Backes
  • 6,141
  • 2
  • 25
  • 29
-1

This regex covers your example, and possibly others, but certainly not robust:

(?:,?(".+?"))|(?:,?(.+?),?)

Here'a a demo on regex101: https://regex101.com/r/wM7uW4/1

bjfletcher
  • 11,168
  • 4
  • 52
  • 67