2

I'm new to in Scala, I have dataframe where I'm trying one column of dataframe to date from string in other word like below

1)    yyyyMMddHHmmss(20150610120256) ->yyyy-MM-dd HH:mm:ss(2015-06-10 12:02:56) 
2)    yyyyMMddHHmmss(20150611      ) ->yyyy-MM-dd(2015-06-11)

First case i'm able achieve successfully but problem with second case where time is miss due to of this i'm not bale to convert into date.More details you could get below.Any help will be appreciated.

df.printSchema
root
 |-- TYPE: string (nullable = true)
 |-- CODE: string (nullable = true)
 |-- SQ_CODE: string (nullable = true)
 |-- RE_TYPE: string (nullable = true)
 |-- VERY_ID: long (nullable = true)
 |-- IN_DATE: string (nullable = true)


df.show
Input  
+-----+-------+---------+---------+-------------------+-----------------+
| TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE        |
+-----+-------+---------+---------+-------------------+-----------------+
|   F | 000544|  2017002|      OP |  95032015062763298| 20150610120256  |
|   F | 000544|  2017002|      LD |  95032015062763261| 20150611        |
|   F | 000544|  2017002|      AK |  95037854336743246| 20150611012356  |
+-----+-------+---------+--+------+-------------------+-----------------+

df=df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
        to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast("date")))
        .otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmss").cast("timestamp")))

Actual output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE             |
+-----+-------+---------+---------+-------------------+----------------------+
|   F | 000544|  2017002|      OP |  95032015062763298| 2015-06-10 12:02:56  |
|   F | 000544|  2017002|      LD |  95032015062763261| null                 |
|   F | 000544|  2017002|      AK |  95037854336743246| 2015-06-11 01:23:56  |
+-----+-------+---------+--+------+-------------------+----------------------+

df=df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
        to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast("timestamp")))
        .otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmss").cast("timestamp")))

Actual output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE             |
+-----+-------+---------+---------+-------------------+----------------------+
|   F | 000544|  2017002|      OP |  95032015062763298| 2015-06-10 12:02:56  |
|   F | 000544|  2017002|      LD |  95032015062763261| 2015-06-11 00:00:00  |
|   F | 000544|  2017002|      AK |  95037854336743246| 2015-06-11 01:23:56  |
+-----+-------+---------+--+------+-------------------+----------------------+


Expected output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE             |
+-----+-------+---------+---------+-------------------+----------------------+
|   F | 000544|  2017002|      OP |  95032015062763298| 2015-06-10 12:02:56  |
|   F | 000544|  2017002|      LD |  95032015062763261| 2015-06-11           |
|   F | 000544|  2017002|      AK |  95037854336743246| 2015-06-11 01:23:56  |
+-----+-------+---------+--+------+-------------------+----------------------+
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
user1363308
  • 938
  • 4
  • 14
  • 33
  • That would be impossible as both dates datatype is different. one is TimestampType and the other is DateType and you can't really have two schemas for the same column. – Ramesh Maharjan Sep 02 '17 at 07:56

4 Answers4

2

I'd

  • Choose more precise data type - here TimestampType.
  • coalesce with different formats.
import org.apache.spark.sql.functions._

val df = Seq("20150610120256", "20150611").toDF("IN_DATE")

df.withColumn("IN_DATE", coalesce(
  to_timestamp($"IN_DATE", "yyyyMMddHHmmss"), 
  to_timestamp($"IN_DATE", "yyyyMMdd"))).show


+-------------------+
|            IN_DATE|
+-------------------+
|2015-06-10 12:02:56|
|2015-06-11 00:00:00|
+-------------------+
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
1

There are several options to achieve a date parser.

  1. Use the built in spark sql function TODATE(). Here's an example of that implementation.
  2. Create a user defined function where you can do different date parsing based on the input format you like, and return the string. Read more about UDF's here.
Sohum Sachdev
  • 1,397
  • 1
  • 11
  • 23
1

2015-06-11 format is spark.sql.types.DateType and 2015-06-10 12:02:56 is spark.sql.types.TimestampType

You can't have two dataType on the same column. A schema should have only one dataType for each columns.

I would suggest you to create two new columns and have the format you desire in them as

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{DateType, TimestampType}
df.withColumn("IN_DATE_DateOnly",from_unixtime(unix_timestamp(df("IN_DATE"),"yyyyMMdd")).cast(DateType))
  .withColumn("IN_DATE_DateAndTime",unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmSS").cast(TimestampType)) 

this will give you dataframe as

+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
|TYPE|CODE  |SQ_CODE|RE_TYPE|VERY_ID          |IN_DATE       |IN_DATE_DateOnly|IN_DATE_DateAndTime  |
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
|F   |000544|2017002|OP     |95032015062763298|20150610120256|null            |2015-06-10 12:02:00.0|
|F   |000544|2017002|LD     |95032015062763261|20150611      |2015-06-11      |null                 |
|F   |000544|2017002|AK     |95037854336743246|20150611012356|null            |2015-06-11 01:23:00.0|
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+

You can see that the dataType is different

root
 |-- TYPE: string (nullable = true)
 |-- CODE: string (nullable = true)
 |-- SQ_CODE: string (nullable = true)
 |-- RE_TYPE: string (nullable = true)
 |-- VERY_ID: string (nullable = true)
 |-- IN_DATE: string (nullable = true)
 |-- IN_DATE_DateOnly: date (nullable = true)
 |-- IN_DATE_DateAndTime: timestamp (nullable = true)

I hope the answer is helpful

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
0

Try This query

df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast(DateType)))
.otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmSS").cast(TimestampType)))
KARTHIKEYAN.A
  • 18,210
  • 6
  • 124
  • 133