How to create a data frame by based on the date value passed as a string in pyspark?

Question

I have a Data set like below: file : test.txt

149|898|20180405
135|379|20180428
135|381|20180406
31|898|20180429
31|245|20180430
135|398|20180422
31|448|20180420
31|338|20180421

I have created data frame by executing below code.

spark = SparkSession.builder.appName("test").getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
df_transac = spark.createDataFrame(sc.textFile("test.txt")\
    .map(lambda x: x.split("|")[:3])\
    .map(lambda r: Row('cCode'= r[0],'pCode'= r[1],'mDate' = r[2])))
df_transac .show()

+-----+-----+----------+
|cCode|pCode|     mDate|
+-----+-----+----------+
|  149|  898| 20180405 |
|  135|  379| 20180428 |
|  135|  381| 20180406 |
|   31|  898| 20180429 |
|   31|  245| 20180430 |
|  135|  398| 20180422 |
|   31|  448| 20180420 |
|   31|  338| 20180421 |
+-----+-----+----------+

my df.printSchemashow like below:

df_transac.printSchema()
root
 |-- customerCode: string (nullable = true)
 |-- productCode: string (nullable = true)
 |-- quantity: string (nullable = true)
 |-- date: string (nullable = true)

but I want to create a data frame based my input dates i.e date1="20180425" date2="20180501"

my expected output is:

+-----+-----+----------+
|cCode|pCode|     mDate|
+-----+-----+----------+
|  135|  379| 20180428 |
|   31|  898| 20180429 |
|   31|  245| 20180430 |
+-----+-----+----------+

please help on this how can I achieve this.

can't you apply filter after that ? – Ramesh Maharjan May 02 '18 at 13:59 — Ramesh Maharjan, May 02 '18 at 13:59

Steven · Accepted Answer · 2018-05-02T18:00:03.297

1

Here is a simple filter applied to your df :

df_transac.where("mdate between '{}' and '{}'".format(date1,date2)).show()

+-----+-----+--------+
|cCode|pCode|   mDate|
+-----+-----+--------+
|  135|  379|20180428|
|   31|  898|20180429|
|   31|  245|20180430|
+-----+-----+--------+

edited May 02 '18 at 18:00

answered May 02 '18 at 14:11

Steven

14,048
6
38
73

1

This works but it relies on lexicographical ordering. You can also convert that string into a date as shown on [this post](https://stackoverflow.com/questions/38080748/convert-pyspark-string-to-date-format). – pault May 02 '18 at 14:19
@Steven,thanks for Quick reply i want my Mdate in string format and it should be in "20180428" not 2018-04-28. can you please let me know how to save "yyyymmdd" format – Sai May 02 '18 at 14:36
@Steven..Thank for your help :) – Sai May 02 '18 at 14:43
@Steven how can i pass date values dynamically to query ,here we are hard coded – Sai May 02 '18 at 15:37
@saikumar i depends where your date values are ? in spark dataframe ? or python args ? – Steven May 02 '18 at 15:40
python args,consider i am assign's my date values to date1='20180405' and date2=''20180501.'.How can i pass this date values to query instated of hard coding inside the query – Sai May 02 '18 at 17:11
@Steven first query working as expected but second query returning empty data frame, i guess because of mdate_filter ,i am not sure why it is covering date into "2018-04-05" if we are using "yyyyMMdd" – Sai May 02 '18 at 17:47
@saikumar you can't use the second with your date. just use the first one. – Steven May 02 '18 at 18:00
@saikumar if solution is working, you can probably accept it. – Steven May 03 '18 at 08:44

How to create a data frame by based on the date value passed as a string in pyspark?

1 Answers1