7

I have just received a bunch of messy data files in CSV (Comma Separated Value) format. I need to do some normal clean up, validation and filtering work on the data set. I will be doing the clean up in Scala (2.11.7).

In my search to find a solution for both directions, input parsing and output composing, mostly I found lots of ill informed tangents, including one from the "Scala Cookbook", on the input parsing side. And most of those focused on the terribly erroneous solution "use String.split(",")" to get a CSV line back as a List[String]. And I found almost nothing on the composing output side.

What kind of nice simple Scala code snippets exist which can easily do the above described CSV round trip? I'd like to avoid importing an entire library just to pick up these two functions (and using a Java library is not an acceptable option for my business requirements at this time).

Community
  • 1
  • 1
chaotic3quilibrium
  • 5,661
  • 8
  • 53
  • 86
  • This question isn't a duplicate as the question explicitly specifies desiring a solution without using a library (especially a Java library). BTW, the question dhg references was closed as "not constructive" for StackOverflow which hardly makes this question a duplicate. – chaotic3quilibrium Sep 10 '15 at 02:10
  • 1
    @dhg, I agree with chaotic3quilibrium--not a duplicate. This is the most useful question on CSV files and Scala I've seen. – Phasmid Sep 10 '15 at 11:12
  • 1
    Ok, fine. But it still seems like a bad case of "reinventing the wheel" to me, and far less safe that using a existing an existing, proven solution to a tricky problem. And personally I see no difference between a scala library and a java library + thin scala wrapper when it comes to something like this. After all, one of the advantages to using scala in the first place is that we get to take advantage of the wealth of pre-existing java libraries. – dhg Sep 10 '15 at 18:58
  • @dhg I think I better see your point. I agree that by far the best case is to choose to use an existing (implied robust) CSV implementation rather than a code snippet (implied fragile). And I will likely choose that option in the future (again, tysvm for providing the links which I added to may answer). However, for those of us that are currently using StringOps.split(",") in our day-to-day grunt work (which includes putting up with architects preventing additional libraries from being added to their project), using the code snippets I provided is at least a better option. – chaotic3quilibrium Sep 10 '15 at 19:40
  • 1
    Use https://github.com/scala/scala-parser-combinators. It's still part of the standard library but it has been separated. –  Sep 10 '15 at 19:48
  • @I.K. While I appreciate the suggestion to use the Scala Parser Combinators, it is well above my current Scala skill set. I am looking forward to digging into them at some point, but for me it looks more than a year out. – chaotic3quilibrium Sep 11 '15 at 02:33
  • Because of so many issues, even in the presence of an RFC for the .csv MIME-type, I strongly suggest you use a well-maintained RFC driven native Scala library which optimally handles this problem, kantan.csv: nrinaudo.github.io/kantan.csv – chaotic3quilibrium Aug 30 '20 at 20:35

1 Answers1

9

UPDATE 2020/08/30: Please use the Scala library, kantan.csv, for the most accurate and correct implementation of RFC 4180 which defines the .csv MIME-type.

While I enjoyed the learning process I experienced creating the solution below, please refrain from using it as I have found a number of issues with it especially at scale. To avoid the obvious technical debt arising from my solution below, choosing a well-maintained RFC driven Scala native solution should be how you take care of your current and future clients.


I have created specific CSV related functions from which a more general solution can be composed.

It turns out that attempting to parse a CSV file is quite tricky due to anomalies around both the comma (,) and the double quote ("). The rules for a CSV are if a column value contains either a comma or a quote, the entire value must be placed in double quotes. And if any double quotes appear in the value, each double quote must be escaped by inserting an additional double quote in front of the existing double quote. This is one of the reasons why the oft cited StringOps.split(",") method simply doesn't work unless one can guarantee they will never encounter a file using the comma/double quote escaping rules. And that's a very unreasonable guarantee.

Additionally, consider that there can be characters between a valid comma separator and the start of a single double quote. Or there can be characters between a final double quote and the next comma or the end of the line. The rules to address this is for those outside-the-double-quote-bounds values to be discarded. This is yet another reason a simple StringOps.split(",") is not only an insufficient answer, but actually incorrect.


One final note about a unexpected behavior I found using StringOps.split(","). Do you know what value result has in this code snippet?:

val result = ",,".split(",")

If you guessed "result references an Array[String] containing three elements of which each is an empty String", you would be incorrect. result references an empty Array[String]. And for me, an empty Array[String] isn't the answer I was expecting or needed. So, for the love of all that is Holy, please Please PLEASE put the final nail in StringOps.split(",") coffin!


So, let's start with the already read in file which is being presented as a List[String]. Below in object Parser is a general solution with two functions; fromLine and fromLines. The latter function, fromLines, is provided for convenience and merely maps across the former function, fromLine.

object Parser {
  def fromLine(line: String): List[String] = {
    def recursive(
        lineRemaining: String
      , isWithinDoubleQuotes: Boolean
      , valueAccumulator: String
      , accumulator: List[String]
    ): List[String] = {
      if (lineRemaining.isEmpty)
        valueAccumulator :: accumulator
      else
        if (lineRemaining.head == '"')
          if (isWithinDoubleQuotes)
            if (lineRemaining.tail.nonEmpty && lineRemaining.tail.head == '"')
              //escaped double quote
              recursive(lineRemaining.drop(2), true, valueAccumulator + '"', accumulator)
            else
              //end of double quote pair (ignore whatever's between here and the next comma)
              recursive(lineRemaining.dropWhile(_ != ','), false, valueAccumulator, accumulator)
          else
            //start of a double quote pair (ignore whatever's in valueAccumulator)
            recursive(lineRemaining.drop(1), true, "", accumulator)
        else
          if (isWithinDoubleQuotes)
            //scan to next double quote
            recursive(
                lineRemaining.dropWhile(_ != '"')
              , true
              , valueAccumulator + lineRemaining.takeWhile(_ != '"')
              , accumulator
            )
          else
            if (lineRemaining.head == ',')
              //advance to next field value
              recursive(
                  lineRemaining.drop(1)
                , false
                , ""
                , valueAccumulator :: accumulator
              )
            else
              //scan to next double quote or comma
              recursive(
                  lineRemaining.dropWhile(char => (char != '"') && (char != ','))
                , false
                , valueAccumulator + lineRemaining.takeWhile(char => (char != '"') && (char != ','))
                , accumulator
              )
    }
    if (line.nonEmpty)
      recursive(line, false, "", Nil).reverse
    else
      Nil
  }
  
  def fromLines(lines: List[String]): List[List[String]] =
    lines.map(fromLine)
}

To validate the above code works for all the various weird input scenarios, some test cases need to be created. So, using the Eclipse ScalaIDE Worksheet, I created a simple set of test cases where I could visually verify the results. Here's the Worksheet contents.

  val testRowsHardcoded: List[String] = {
    val superTrickyTestCase = {
      val dqx1 = '"'
      val dqx2 = dqx1.toString + dqx1.toString
      s"${dqx1}${dqx2}a${dqx2} , ${dqx2}1${dqx1} , ${dqx1}${dqx2}b${dqx2} , ${dqx2}2${dqx1} , ${dqx1}${dqx2}c${dqx2} , ${dqx2}3${dqx1}"
    }
    val nonTrickyTestCases =
"""
,,
a,b,c
a,,b,,c
 a, b, c
a ,b ,c
 a , b , c
"a,1","b,2","c,2"
"a"",""1","b"",""2","c"",""2"
 "a"" , ""1" , "b"" , ""2" , "c"",""2"
""".split("\n").tail.toList
   (superTrickyTestCase :: nonTrickyTestCases.reverse).reverse
  }
  val parsedLines =
    Parser.fromLines(testRowsHardcoded)
  parsedLines.map(_.mkString("|")).mkString("\n")

I visually verifyed the tests completed correctly and had left me with decomposed accurate raw strings. So, I now had what I needed for the input parsing side so I could begin my data refining.

After data refining was completed, I needed to be able to compose output so I could send my refined data back out reapplying all the CSV encoding rules.

So, let's start with a List[List[String]] as the source of the refinements. Below in object Composer is a general solution with two functions; toLine and toLines. The latter function, toLines, is provided for convenience and merely maps across the former function, toLine.

object Composer {
  def toLine(line: List[String]): String = {
    def encode(value: String): String = {
      if ((value.indexOf(',') < 0) && (value.indexOf('"') < 0))
        //no commas or double quotes, so nothing to encode
        value
      else
        //found a comma or a double quote,
        //  so double all the double quotes
        //  and then surround the whole result with double quotes
        "\"" + value.replace("\"", "\"\"") + "\""
    }
    if (line.nonEmpty)
      line.map(encode(_)).mkString(",")
    else
      ""
  }

  def toLines(lines: List[List[String]]): List[String] =
    lines.map(toLine)
}

To validate the above code works for all the various weird input scenarios, I reused the test cases I used for Parser. Again, using the Eclipse ScalaIDE Worksheet, I added a bit more code below my existing code where I could visually verify the results. Here's the the code I added:

val composedLines =
  Composer.toLines(parsedLines)
composedLines.mkString("\n")
val parsedLines2 =
  Parser.fromLines(composedLines)
parsedLines == parsedLines2

When the Scala WorkSheet is saved, it executes its contents. The very last line should show a value of "true". It is the result of round tripping all the test cases through the parser, through the composer and back through the parser.

BTW, it turns out there is a ton of variation around the definition of a "CSV file". So, here's the source for the rules the code above enforces.

PS. Thanks to @dhg pointing it out, there is a CSV Scala library which handles parsing CSVs, just in case you want something which is likely more robust and has more options than my Scala code snippets above.

chaotic3quilibrium
  • 5,661
  • 8
  • 53
  • 86
  • 1
    Could you add your source regarding csv escaping rules? I'm not sure that what you describe is universal, so it would be nice to know what software produces the csvs your code parses. I'd assume MS Excel and compatible software? Anyway thanks for documenting this here! – Silly Freak Sep 09 '15 at 20:52
  • 1
    @SillyFreak Tysvm for requesting that. I just appended a paragraph to cover both CSV being a loose definition AND the basis for the rules I used. – chaotic3quilibrium Sep 09 '15 at 21:01
  • 3
    Seems like an absurd amount of effort just to avoid using a pre-existing library. Not to mention that an existing library has already thought of all the corner cases and is thoroughly tested. – dhg Sep 09 '15 at 21:54
  • @dhg Do you have a Scala library you would suggest? It needs to handle both parsing input and composing output. I wasn't able to find anything obvious. – chaotic3quilibrium Sep 09 '15 at 22:08
  • Just a few from a simple web search: https://commons.apache.org/proper/commons-csv/, http://opencsv.sourceforge.net/, http://super-csv.github.io/super-csv/, https://github.com/tototoshi/scala-csv, ... For the Java ones, if you want something scala-y, you can just write a thin wrapper and not have to worry about all the complex internals. – dhg Sep 10 '15 at 00:31
  • I definitely am not interested a Java solution. And I just had a deep look at the one Scala library you listed. And while his implementation has more options (and quite a bit more complexity, too), I don't like the core implementation (very mutable/var/Java-like) as opposed to my much simpler immutable FP above. – chaotic3quilibrium Sep 10 '15 at 01:00
  • @dhg You would be surprised, I think, at how many of us cannot find anything related to parsing a CSV in Scala that isn't the simple StringOps.split(",") technique. I know I spent half an hour trying to find something and then finally just wrote something myself (which took less than another half an hour). I then realized writing this explicit SO Q/A would at least anchor people to a place where they were explicitly informed using StringOps.split(",") was a very bad idea. Hence, my spending the time to save lots of other Scala coders time. – chaotic3quilibrium Sep 10 '15 at 01:13
  • 1
    Thanks for this, @chaotic3quilibrium. I agree this is useful and, moreover, instructive. Is it normal StackOverflow practice though to ask and answer your own question? Do you get points for answering just as if you answered someone else's question? – Phasmid Sep 10 '15 at 11:09
  • 1
    @Phasmid I don't know, and honestly don't care. It is vastly more costly for me to have to spend hours looking for a (robust) solution, or fixing someone else's broken library (assuming it's even possible) or even in the writing my own solution (only to forget about it when I need it on a future project), than whatever arbitrary non-financial point based reward system is behind SO (StackOverflow). And SO encourages both posting questions and providing your own answer. I've done this before and ended up choosing a different answer than mine because it was better. – chaotic3quilibrium Sep 10 '15 at 15:09
  • 1
    @chaotic3quilibrium No to debate anything you said, you could ensure you always have to minimum number of elements in the array with `val result = ",,".split(",").padTo(3, "")`. (So far I could use it because I had to guarantee that the CSV would be minimalistic. Now I'm the process of evaluating CSV parser libraries to support the RFC.) – István Oct 07 '21 at 13:30