1

I am getting the error:

org.apache.spark.sql.analysisexception: cannot resolve 'year'

My input data:

1,2012-07-21,2014-04-09

My code:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class c (id:Int,start:String,end:String)
val c1 = sc.textFile("date.txt")
val c2 = c1.map(_.split(",")).map(r=>(c(r(0).toInt,r(1).toString,r(2).toString)))
val c3 = c2.toDF();
c3.registerTempTable("c4")
val r = sqlContext.sql("select id,datediff(year,to_date(end), to_date(start)) AS date from c4")

What can I do resolve above error?

I have tried the following code but I got the output in days and I need it in years

val r = sqlContext.sql("select id,datediff(to_date(end), to_date(start)) AS date from c4")

Please advise me if i can use any function like to_date to get year difference.

SkyWalker
  • 13,729
  • 18
  • 91
  • 187
  • Divide by 365? There is also a date part that allows you to specify years. See https://learn.microsoft.com/en-us/sql/t-sql/functions/datediff-transact-sql Google is your friend. – Missy Sep 19 '17 at 15:30
  • Thanks for your input. i have tried dividing by 365 using UDF's @Missy –  Sep 20 '17 at 14:27
  • The link to learn.microsoft.com refers to T-SQL (i.e. the procedural language used in SQL Server), it does not refer to Spark. – jamiet Oct 26 '17 at 12:49

4 Answers4

3

Another simple way to cast the string to dateType in spark sql and apply sql dates and time functions on the columns like following :

import org.apache.spark.sql.types._
val c4 = c3.select(col("id"),col("start").cast(DateType),col("end").cast(DateType))

c4.withColumn("dateDifference", datediff(col("end"),col("start")))
  .withColumn("monthDifference", months_between(col("end"),col("start")))
  .withColumn("yearDifference", year(col("end"))-year(col("start")))
  .show()
Rishikesh Teke
  • 408
  • 3
  • 10
  • Hello Rishikesh.What happens when the difference is say 9 years and 10 months.Would it round it off to 10 years? – Bharath Oct 12 '17 at 19:45
  • @Bharath No it won't round off. To do that check out Spark documents for roundoff function or u can create the separate UDF for it. – Rishikesh Teke Oct 13 '17 at 12:27
3

One of the above answers doesn't return the right Year when days between two dates less than 365. Below example provides the right year and rounds the month and year to 2 decimal.

Seq(("2019-07-01"),("2019-06-24"),("2019-08-24"),("2018-12-23"),("2018-07-20")).toDF("startDate").select(
col("startDate"),current_date().as("endDate"))
.withColumn("datesDiff", datediff(col("endDate"),col("startDate")))
.withColumn("montsDiff", months_between(col("endDate"),col("startDate")))
.withColumn("montsDiff_round", round(months_between(col("endDate"),col("startDate")),2))
.withColumn("yearsDiff", months_between(col("endDate"),col("startDate"),true).divide(12))
.withColumn("yearsDiff_round", round(months_between(col("endDate"),col("startDate"),true).divide(12),2))
.show()

Outputs:

+----------+----------+---------+-----------+---------------+--------------------+---------------+
| startDate|   endDate|datesDiff|  montsDiff|montsDiff_round|           yearsDiff|yearsDiff_round|
+----------+----------+---------+-----------+---------------+--------------------+---------------+
|2019-07-01|2019-07-24|       23| 0.74193548|           0.74| 0.06182795666666666|           0.06|
|2019-06-24|2019-07-24|       30|        1.0|            1.0| 0.08333333333333333|           0.08|
|2019-08-24|2019-07-24|      -31|       -1.0|           -1.0|-0.08333333333333333|          -0.08|
|2018-12-23|2019-07-24|      213| 7.03225806|           7.03|         0.586021505|           0.59|
|2018-07-20|2019-07-24|      369|12.12903226|          12.13|  1.0107526883333333|           1.01|
+----------+----------+---------+-----------+---------------+--------------------+---------------+

You can find a complete working example at below URL

https://sparkbyexamples.com/spark-calculate-difference-between-two-dates-in-days-months-and-years/

Hope this helps.

Happy Learning !!

NNK
  • 1,044
  • 9
  • 24
1
val r = sqlContext.sql("select id,datediff(year,to_date(end), to_date(start)) AS date from c4")

In the above code, "year" is not a column in the data frame i.e it is not a valid column in table "c4" that is why analysis exception is thrown as query is invalid, query is not able to find the "year" column.

Use Spark User Defined Function (UDF), that will be a more robust approach.

T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
hagarwal
  • 1,153
  • 11
  • 27
  • @AshSr SparkSQL's datediff() function does not have an `interval` parameter (or similar) at the time of writing. – jamiet Oct 26 '17 at 12:53
0

Since dateDiff only returns the difference between days. I prefer to use my own UDF.

import java.sql.Timestamp
import java.time.Instant
import java.time.temporal.ChronoUnit

import org.apache.spark.sql.functions.{udf, col}
import org.apache.spark.sql.DataFrame

def timeDiff(chronoUnit: ChronoUnit)(dateA: Timestamp, dateB: Timestamp): Long = {
    chronoUnit.between(
      Instant.ofEpochMilli(dateA.getTime),
      Instant.ofEpochMilli(dateB.getTime)
    )
}

def withTimeDiff(dateA: String, dateB: String, colName: String, chronoUnit: ChronoUnit)(df: DataFrame): DataFrame = {
    val timeDiffUDF = udf[Long, Timestamp, Timestamp](timeDiff(chronoUnit))
    df.withColumn(colName, timeDiffUDF(col(dateA), col(dateB)))
}

Then I call it as a dataframe transformation.

df.transform(withTimeDiff("sleepTime", "wakeupTime", "minutes", ChronoUnit.MINUTES)
Franzi
  • 1,791
  • 23
  • 21