0

First of all: credits. This code is based on the solution from here: Use Scala parser combinator to parse CSV files

The CSV files I want to parse can have comments, lines starting with #. And to avoid confusion: The CSV files are tabulator-separated. There are more constraints which would make the parser a lot easier, but since I am completly new to Scala I thought it would be best to stay as close to the (working) original as possible.

The problem I have is that I get a type mismatch. Obviously the regex for a comment does not yield a list. I was hoping that Scala would interpret a comment as a 1-element-list, but this is not the case.

So how would I need to modify my code that I can handle this comment lines? And closly related: Is there an elegant way to query the parser result so I can write in myfunc something like

if (isComment(a)) continue

So here is the actual code:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import scala.util.parsing.combinator._

object MyParser extends RegexParsers {

    override val skipWhitespace = false   // meaningful spaces in CSV

    def COMMA   = ","
    def TAB     = "\t"
    def DQUOTE  = "\""
    def HASHTAG = "#"
    def DQUOTE2 = "\"\"" ^^ { case _ => "\"" }  // combine 2 dquotes into 1
    def CRLF    = "\r\n" | "\n"
    def TXT     = "[^\",\r\n]".r
    def SPACES  = "[ ]+".r

    def file: Parser[List[List[String]]] = repsep((comment|record), CRLF) <~ (CRLF?)
    def comment: Parser[List[String]] = HASHTAG<~TXT
    def record: Parser[List[String]] = "[^#]".r<~repsep(field, TAB)
    def field: Parser[String] = escaped|nonescaped

    def escaped: Parser[String] = {
        ((SPACES?)~>DQUOTE~>((TXT|COMMA|CRLF|DQUOTE2)*)<~DQUOTE<~(SPACES?)) ^^ {
            case ls => ls.mkString("")
        }
    }
    def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }

    def applyParser(s: String) = parseAll(file, s) match {
        case Success(res, _) => res
        case e => throw new Exception(e.toString)
    }

    def myfunc( a: (String, String)) = {
        val parserResult = applyParser(a._2)
        println("APPLY PARSER FOR " + a._1)
        for( a <- parserResult ){
            a.foreach { println }
        }
    }

    def main(args: Array[String]) {
        val filesPath = "/home/user/test/*.txt"
        val conf = new SparkConf().setAppName("Simple Application")
        val sc = new SparkContext(conf)
        val logData = sc.wholeTextFiles(filesPath).cache()
        logData.foreach( x => myfunc(x))
    }
}
Community
  • 1
  • 1
flowit
  • 1,382
  • 1
  • 10
  • 36

1 Answers1

1

Since the parser for comment and the parser for record are "or-ed" together they must be of the same type.
You need to make the following changes:

def comment: Parser[List[String]] = HASHTAG<~TXT ^^^ {List()}

By using ^^^ we are converting the result of the parser (which is the result returned by HASHTAG parser) to an empty List.
Also change:

def record: Parser[List[String]] = repsep(field, TAB)

Note that because comment and record parser are or-ed and because comment comes first, if the row begins with a "#" it will be parsed by the comment parser.

Edit:
In order to keep the comments text as an output of the parser (say if you want to print them later), and because you are using | you can do the following:
Define the following classes:

trait Line
case class Comment(text: String) extends Line
case class Record(elements: List[String]) extends Line

Now define comment, record & file parsers as follows:

val comment: Parser[Comment] = "#" ~> TXT ^^ Comment
val record :Parser[Line]= repsep(field, TAB) ^^ Record
val file: Parser[List[Line]] = repsep(comment | record, CRLF) <~ (CRLF?)

Now you can define the printing function myFunc:

def myfunc( a: (String, String)) = {
  parseAll(file, a._2).map { lines =>
   lines.foreach{
     case Comment(t) => println(s"This is a comment: $t")
     case Record(elems) => println(s"This is a record: ${elems.mkString(",")}")
   }
  }
}
Vered Rosen
  • 381
  • 3
  • 11
  • I mark it as accepted. Can you answer my second question, too? If there is an elegant way to find out if my List[String] I want to print is a comment or an actual record? Currently it seems that comments are somehow skipped. Is this because we are converting the comment parser result to an empty list? – flowit Oct 14 '15 at 11:29
  • Yes, since a comment line will be parsed to an empty List, the parsed result will return `List[List[String]]` with an empty inner List for the comment lines. Thus the internal `foreach` will not do anything as the iterated List is empty. – Vered Rosen Oct 14 '15 at 12:34
  • Great to know I draw the right conclusions. But nevertheless: If I want the comment to be parsed and not to be thrown away (what returning an empty list instead does), how would I accomplish this? – flowit Oct 14 '15 at 13:14