2

I'm totally new to Scala, and am trying to parse a CSV file that has carriage return/new line/and other special characters like comma in some of the cells (i.e. within double quotations), for example:

"A","B","C\n,FF\n","D"\n
"Q","W","E","R\n\n"\n
"1","2\n","2","2,2\n"\n

I want to load this into a list of lists type in Scala, like the following:

List(List("A","B","C,FF","D"),List("Q","W","E","R"),List("1","2","2","2,2"))

Any suggestions how it can be done?

I have found some solutions for the same problem in other languages. For example this is a great one in Python, which I understand well: Handling extra newlines (carriage returns) in csv files parsed with Python?

My try:

val src2 = Source.fromFile("sourceFileName.csv")
val it =src2.getLines()
val data = for (i<-it) yield i.replace("\"","").split(",")

But it looks like all carriage returns are seen as new lines.

Community
  • 1
  • 1
Alt
  • 2,597
  • 5
  • 26
  • 36
  • 1
    Are there actually new lines in the strings, or do they contain the string literal "\n" (backslash followed by 'n')? – Ben Reich Jul 02 '15 at 19:26
  • It should be either '\n' or '\r' (or some other new line character like ctrl M), but what I can see is that src2.getLines() returns broken lines (where the lines are broken where the cells have the new line character) – Alt Jul 02 '15 at 19:32
  • @BenReich: They contain newlines. – Alt Jul 02 '15 at 19:43

3 Answers3

2

It seems to me that if the actual cells contain newlines, then you'll need to keep some state while traversing getLines. You can do this using a foldLeft or similar operator. If the file is small enough, you can also use mkString to get the whole file as a string in memory and then operate on that. The following simplified version assumes that every cell is surrounded by quotes. For example:

val converted = Source.fromFile(sourceFileName).mkString.replaceAll("\n", "").replaceAll("\"\"", "\"\n\"")

First, we're removing all new lines. Then, the true new lines will manifest as two quotes in a row (since otherwise there would be a comma separating the quotes), so we add back the new lines between the quotes. Then we should have a normalized version of the file, and we can procede with simple operations:

converted.split("\n").map(_.split(",").map(_.replaceAll("\"", "")))
Ben Reich
  • 16,222
  • 2
  • 38
  • 59
  • I think, I should do some sort of regular expression parsing to only remove the \n's that are within double quotes. Your solution-for example-doesn't work for the example csv file that I have given in the question statement. – Alt Jul 02 '15 at 20:13
  • The pattern will be "\".*\\n.*\"".r, but I don't know how I can only remove \n's and keep the rest of the pattern intact. I think I should find them first and then iterate through them in the original text. But this looks like an extremely "non"-functional approach. I'm trying to find a better solution. – Alt Jul 02 '15 at 20:16
  • I don't understand how this doesn't work for the example provided. If I replicate that file (and use a newline whenever there is a '\n' in the file), then this approach works for me. What results are you getting? It might help if you clarified the contents of the file, since it works with this approach according to my understanding of the file contents. – Ben Reich Jul 02 '15 at 20:38
  • You are right, it works for the example that I have provided, but one thing that I hadn't seen was that the not only the cells contain line breaks, but also they contain comma's. I have come up with a solution that is very much similar to yours, and now works fine. Thanks any ways. – Alt Jul 02 '15 at 21:18
  • @Alt Glad I could help. I'm confused as to why you checked the other answer, which you wrote in the comments does not solve your problem. I also wonder if your new answer is different enough from mine to warrant its own answer (as opposed to a comment here), especially considering your problem as stated did not include the extra details added in your answer, but of course that's up to your judgement. – Ben Reich Jul 02 '15 at 22:00
  • I meant to choose yours, that was a mistake. I have to add those extra details to the question. Then please change your answer accordingly. – Alt Jul 02 '15 at 22:14
  • @Alt It's totally fine to have your own answer with the extra details – I was just confused by the check :) – Ben Reich Jul 02 '15 at 22:15
1

The easiest way is to filter the blank lines out and trim the extra whitespace:

val src2 = Source.fromFile(sourceFileName)
val it = src2.getLines()
val data = for (i<-it if !i.trim.isEmpty) yield i.trim.replace("\"","").split(",")
Karol S
  • 9,028
  • 2
  • 32
  • 45
  • Only some of them will be empty (the ones that have two or more '\n's). This doesn't solve the problem. – Alt Jul 02 '15 at 19:01
0

I found a solution. It's similar to Ben Reich's solution, but takes care of any other special character inside the double quotations:

Source.fromFile(sourceFileName).mkString.split("\"\n\"").map(_.split("\",\""))
Alt
  • 2,597
  • 5
  • 26
  • 36