6

I have had a similar problem before, but I am looking for a generalizable answer. I am using spark-corenlp to get Sentiment scores on e-mails. Sometimes, sentiment() crashes on some input (maybe it's too long, maybe it had an unexpected character). It does not tell me it crashes on some instances, and just returns the Column sentiment('email). Thus, when I try to show() beyond a certain point or save() my data frame, I get a java.util.NoSuchElementException because sentiment() must have returned nothing at that row.

My initial code is loading the data, and applying sentiment() as shown in spark-corenlp API.

       val customSchema = StructType(Array(
                        StructField("contactId", StringType, true),
                        StructField("email", StringType, true))
                        )

// Load dataframe   
val df = sqlContext.read
                        .format("com.databricks.spark.csv")
                        .option("delimiter","\t")          // Delimiter is tab
                        .option("parserLib", "UNIVOCITY")  // Parser, which deals better with the email formatting
                        .schema(customSchema)              // Schema of the table
                        .load("emails")                        // Input file


    val sent = df.select('contactId, sentiment('email).as('sentiment)) // Add sentiment analysis output to dataframe

I tried to filter for null and NaN values:

val sentFiltered = sent.filter('sentiment.isNotNull)
                .filter(!'sentiment.isNaN)
                .filter(col("sentiment").between(0,4))

I even tried to do it via SQL query:

sent.registerTempTable("sent")
val test = sqlContext.sql("SELECT * FROM sent WHERE sentiment IS NOT NULL")

I don't know what input is making the spark-corenlp crash. How can I find out? Else, how can I filter these non existing values from col("sentiment")? Or else, should I try catching the Exception and ignore the row? Is this even possible?

Community
  • 1
  • 1
Béatrice Moissinac
  • 934
  • 2
  • 16
  • 41
  • 1
    I have solved my current problem by just increasing the precision of my parser/cleaner, but the question persist for a general setting. – Béatrice Moissinac Jul 06 '16 at 22:49
  • Hi @Béatrice Moissinac, I have exactly the same issue, how did you solve it? can you share the code pls? my code is here : [error](http://stackoverflow.com/questions/39983486/spark-2-0-1-write-error-caused-by-java-util-nosuchelementexception) – elcomendante Oct 12 '16 at 13:12
  • No, only improving the cleanliness of the input has worked so far. If I had time, I would modify function.sentiment() to throw an error instead of returning the row :D – Béatrice Moissinac Oct 12 '16 at 15:58
  • thx, I have cleaned it before, I would prefer to keep oryginal text as much as poss though _> i think better results are yield when mulpiple sentences are inputed as one message, will look it more in depth, not sure where is the issue in code: 1. in sentiment function call while assigning integer value to each message? does it attach multiple values? maybe it is an array, will look it up – elcomendante Oct 12 '16 at 16:18
  • What I did was just compute the sentiment for each sentence separately, (one sentence per row in the df) then use an aggregation function of your choice to combine the scores of the message. Look at functions.sentiment() in functions.scala, it clearly says it cuts it down to the first sentence, unless you re-write this, there is no avoiding it. BUT because core-nlp is a tree algorithm, I would recommend keeping it as sentence-length, because it gets big very quickly otherwise. – Béatrice Moissinac Oct 12 '16 at 17:04

0 Answers0