2

I want to get the number of months between two dates, I'm reading the start date and end date from csv file.

id          startDate  endDate
100         5/1/2016   5/1/2017
200         5/2/2016   5/1/2017
300         5/2/2016   5/1/2017

My output should look like:

id          startDate  endDate     res
100         5/1/2016   5/1/2017    12
200         5/2/2016   5/1/2017    11
300         5/3/2016   5/1/2017    10

Please let me know what is the wrong in my code,

val data = spark.read.option("header", "true").csv("sample.csv");
val result = data.withColumn("res", withColumn("Months", ChronoUnit.MONTHS.between(startDate ,endDate)).show()
xaxxon
  • 19,189
  • 5
  • 50
  • 80
Rjj
  • 249
  • 1
  • 7
  • 22
  • What does your code currently give you? An error? And what is the column type of startDate and endDate? – Shaido Jan 22 '18 at 07:08

4 Answers4

1

Here's how you can do it with Spark:

import org.apache.spark.sql.functions

import spark.implicits._

val result = data.withColumn(
    "res",
    functions.months_between(
      functions.to_date($"endDate", "M/d/yyyy"),
      functions.to_date($"startDate", "M/d/yyyy")
    )
)

The second parameter for withColumn is of Spark's Column type. You can't pass an arbitrary Java/Scala expression to it.

Note that to_date was only added at Spark 2.2.0.

If you are on an older version of Spark, you can define a UDF, a custom function that will convert a string to a date in your specific format:

import java.time.format.DateTimeFormatter
import java.time.LocalDate
import java.sql.Date

val strToDate = functions.udf { 
    val fmt = DateTimeFormatter.ofPattern("M/d/yyyy")
    date: String => 
      Date.valueOf(LocalDate.parse(date, fmt)) 
}

Now, equipped with strToDate, we can convert our string columsn to dates and apply months_between:

val result = data.withColumn(
    "res",
    functions.months_between(
      strToDate($"endDate"),
      strToDate($"startDate")
    )
)
thesamet
  • 6,382
  • 2
  • 31
  • 42
  • Thanks for your quick reply. I am getting result column with data null.id startDate endDate monthsBetween 100 5/1/2016 5/1/2017 null 200 5/2/2016 5/1/2017 null 300 5/2/2016 5/1/2017 null – Rjj Jan 22 '18 at 07:44
  • @Chandra Looks like your dates are in string format. I updated the answer to show you how to convert them to dates, so you can use `months_between` on them. – thesamet Jan 22 '18 at 08:06
  • 1
    @thesamet to_date($"startDate", "MM/dd/yyyy") this one available only after Spark 2.2.0 – koiralo Jan 22 '18 at 08:11
  • @ShankarKoirala added this note to my answer. – thesamet Jan 22 '18 at 08:13
1
  1. Convert your columns to date datatype.
  2. You can use SQL datediff function.

Syntax :

val dt = sqlcontext.sql("SELECT DATEDIFF(month, start_date, end_date) AS DateDiff from relation")

You can refer the following link for datediff: Datediff

Here is a similar question : stackoverflow

0

You can use the provided spark function called months_between,

import org.apache.spark.sql.functions._

val result = data.withColumn("monthsBetween", months_between($"startDate", $"endDate"))
sarveshseri
  • 13,738
  • 28
  • 47
0

The issue with your code is only Type Casting. So during reading time, you need to infer your schema. If you want to print your schema, you can notice all the columns are String Type. Hence the months_between function is returning null Value.

data.printSchema()

root
|-- id: string (nullable = true)
|-- startDate: string (nullable = true)
|-- endDate: string (nullable = true)

You can use this below code:

val data = sqlContext.read
    .option("header", "true")
    .option("dateFormat", "d/M/yyyy")
    .option("inferSchema", "true")
    .csv("sample.csv")

data.printSchema()

root
|-- id: integer (nullable = true)
|-- startDate: timestamp (nullable = true)
|-- endDate: timestamp (nullable = true)
import org.apache.spark.sql.functions
val result = data.withColumn("res", functions.months_between($"endDate", $"startDate"))
    result.show()
+---+--------------------+--------------------+----+
| id|           startDate|             endDate| res|
+---+--------------------+--------------------+----+
|100|2016-01-05 00:00:...|2017-01-05 00:00:...|12.0|
|200|2016-02-05 00:00:...|2017-01-05 00:00:...|11.0|
|300|2016-03-05 00:00:...|2017-01-05 00:00:...|10.0|
+---+--------------------+--------------------+----+
Souvik
  • 377
  • 4
  • 16