2

I am trying to parse a comma separated string using:

val array = input.split(",")

Then I notice that some input lines have "," inside a quotation mark:

data0, "data1", data2, data3, "data4-1, data4-2, data4-3", data5

*Note that the data is not very clean, so some fields are inside quotation marks while some don't


How do I split such line into:

array(0) = data0
array(1) = data1
array(2) = data2
array(3) = data3
array(4) = data4-1, data4-2, data4-3
array(5) = data5
Martin Senne
  • 5,939
  • 6
  • 30
  • 47
Edamame
  • 23,718
  • 73
  • 186
  • 320
  • 3
    Parsing CSV files can be notoriously tricky due to its behaviour around quotes, and commas and quotes included in quoted values. I recommend pulling in a library which is well regarded for dealing robustly with all the edge cases. Options you could consider include [scala-csv](https://github.com/tototoshi/scala-csv), and [traversable-csv](http://labs.encoded.io/2012/04/09/reading-csv-files-in-scala-the-traversable-way/). Or use a Java library like [opencsv](http://opencsv.sourceforge.net/). – Shadowlands Sep 27 '15 at 03:19
  • Otherwise, if you don't want to or can't use a library, you could look at [this SO answer](http://stackoverflow.com/questions/5063022/use-scala-parser-combinator-to-parse-csv-files/5063652#5063652) or [this SO answer](http://stackoverflow.com/questions/32488364/whats-a-simple-scala-only-way-to-read-in-and-then-write-out-a-small-csv-file/32488453#32488453) to see how others have tackled roll-your-own CSV parsers. – Shadowlands Sep 27 '15 at 03:20
  • @Shadowlands Could you please summarize your comments in an answer ( as I think you have shown many valuable approaches, others can benefit from.) Thx. – Martin Senne Sep 27 '15 at 09:12
  • @MartinSenne Sure, happy to make it an answer (although I don't have anything much further to add). – Shadowlands Sep 27 '15 at 09:39

4 Answers4

7

As per my comments:

Parsing CSV files can be notoriously tricky due to its behaviour around quotes, and commas and quotes included in quoted values. I recommend pulling in a library which is well regarded for dealing robustly with all the edge cases.

Options you could consider include scala-csv, and traversable-csv. Or use a Java library like opencsv.

Otherwise, if you don't want to or can't use a library, you could look at this SO answer or this SO answer to see how others have tackled roll-your-own CSV parsers.

Community
  • 1
  • 1
Shadowlands
  • 14,994
  • 4
  • 45
  • 43
  • 1
    Thanks Shadowlands! I would like to use library if possible. One problem I have here is instead of having "one csv file", I will get "csv lines" that I need to parse. Is there any library that could parse a line instead of an entire file? Thank you! – Edamame Sep 27 '15 at 14:32
0

I would recommend using a CSV library to parse CSV data - the format is a mess and painful to get right.

I would suggest kantan.csv, mainly because I'm the author but also because it lets you got a bit further than turning a CSV stream into a list of arrays of strings. Take, for example, the following input:

1,Foo,2.0
2,Bar,false

Using kantan.csv, you can write:

import kantan.csv.ops._

new File("path/to/csv").asUnsafeCsvRows[(Int, String, Either[Float, Boolean])](',', false)

Calling toList on the result will yield:

List((1,Foo,Left(2.0)), (2,Bar,Right(false)))

Note how the last column is either a float or a boolean, but this is captured in the type of each element of the iterator.

Nicolas Rinaudo
  • 6,068
  • 28
  • 41
0

Below is my solution to parse CSV row:

String[] res = row.split(";");
for (int i = 0; i < res.length; i++) {
    res[i] = deQuotes(res[i]);
}
return res;

remove quotes with REGEXP:

static final Pattern PATTERN_DE_QUOTES = Pattern.compile("(?i)^\\\"(.*)\\\"$");

static String deQuotes(String s) {
    Matcher matcher;
    if ((matcher = PATTERN_DE_QUOTES.matcher(s)).find()) {
        return matcher.group(1).replaceAll("\"\"", "\"");
    }
    return s;
}

I hope it will help you.

Dmytro Sokolyuk
  • 966
  • 13
  • 13
-1

You can actually split that line with a regex expression.

val s = """data0, "data1", data2, data3, "data4-1, data4-2, data4-3", data5"""

"""((".*?")|('.*?')|[^"',]+)+""".r.findAllIn(s).foreach(println)

btw. any library that can parse csv files can also parse a single csv line. Just wrap the string into a StringReader.

Tesseract
  • 8,049
  • 2
  • 20
  • 37
  • Thanks! Would you please elaborate a bit more about "any library that can parse csv files can also parse a single csv line". For example, how do I modify the following file parser to parse a single csv line? CSVReader reader = new CSVReader(new FileReader("yourfile.csv")); – Edamame Sep 28 '15 at 05:29
  • That should work like this `CSVReader reader = new CSVReader(new StringReader("data1,data2,data3"))` – Tesseract Sep 28 '15 at 14:24