-2

I have a Data set like below: file : test.txt

149|898|20180405
135|379|20180428
135|381|20180406
31|898|20180429
31|245|20180430
135|398|20180422
31|448|20180420
31|338|20180421

I have created data frame by executing below code.

spark = SparkSession.builder.appName("test").getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
df_transac = spark.createDataFrame(sc.textFile("test.txt")\
    .map(lambda x: x.split("|")[:3])\
    .map(lambda r: Row('cCode'= r[0],'pCode'= r[1],'mDate' = r[2])))
df_transac .show()

+-----+-----+----------+
|cCode|pCode|     mDate|
+-----+-----+----------+
|  149|  898| 20180405 |
|  135|  379| 20180428 |
|  135|  381| 20180406 |
|   31|  898| 20180429 |
|   31|  245| 20180430 |
|  135|  398| 20180422 |
|   31|  448| 20180420 |
|   31|  338| 20180421 |
+-----+-----+----------+

my df.printSchemashow like below:

df_transac.printSchema()
root
 |-- customerCode: string (nullable = true)
 |-- productCode: string (nullable = true)
 |-- quantity: string (nullable = true)
 |-- date: string (nullable = true)

but I want to create a data frame based my input dates i.e date1="20180425" date2="20180501"

my expected output is:

+-----+-----+----------+
|cCode|pCode|     mDate|
+-----+-----+----------+
|  135|  379| 20180428 |
|   31|  898| 20180429 |
|   31|  245| 20180430 |
+-----+-----+----------+

please help on this how can I achieve this.

pault
  • 41,343
  • 15
  • 107
  • 149
Sai
  • 1,075
  • 5
  • 31
  • 58

1 Answers1

1

Here is a simple filter applied to your df :

df_transac.where("mdate between '{}' and '{}'".format(date1,date2)).show()

+-----+-----+--------+
|cCode|pCode|   mDate|
+-----+-----+--------+
|  135|  379|20180428|
|   31|  898|20180429|
|   31|  245|20180430|
+-----+-----+--------+
Steven
  • 14,048
  • 6
  • 38
  • 73
  • 1
    This works but it relies on lexicographical ordering. You can also convert that string into a date as shown on [this post](https://stackoverflow.com/questions/38080748/convert-pyspark-string-to-date-format). – pault May 02 '18 at 14:19
  • @Steven,thanks for Quick reply i want my Mdate in string format and it should be in "20180428" not 2018-04-28. can you please let me know how to save "yyyymmdd" format – Sai May 02 '18 at 14:36
  • @Steven..Thank for your help :) – Sai May 02 '18 at 14:43
  • @Steven how can i pass date values dynamically to query ,here we are hard coded – Sai May 02 '18 at 15:37
  • @saikumar i depends where your date values are ? in spark dataframe ? or python args ? – Steven May 02 '18 at 15:40
  • python args,consider i am assign's my date values to date1='20180405' and date2=''20180501.'.How can i pass this date values to query instated of hard coding inside the query – Sai May 02 '18 at 17:11
  • @Steven first query working as expected but second query returning empty data frame, i guess because of mdate_filter ,i am not sure why it is covering date into "2018-04-05" if we are using "yyyyMMdd" – Sai May 02 '18 at 17:47
  • @saikumar you can't use the second with your date. just use the first one. – Steven May 02 '18 at 18:00
  • @saikumar if solution is working, you can probably accept it. – Steven May 03 '18 at 08:44