28

pyspark's 'between' function is not inclusive for timestamp input.

For example, if we want all rows between two dates, say, '2017-04-13' and '2017-04-14', then it performs an "exclusive" search when the dates are passed as strings. i.e., it omits the '2017-04-14 00:00:00' fields

However, the document seem to hint that it is inclusive (no reference on timestamp though)

Of course, one way is to add a microsecond from the upper bound and pass it to the function. However, not a great fix. Any clean way of doing inclusive search?

Example:

import pandas as pd
from pyspark.sql import functions as F
... sql_context creation ...
test_pd=pd.DataFrame([{"start":'2017-04-13 12:00:00', "value":1.0},{"start":'2017-04-14 00:00:00', "value":1.1}])
test_df = sql_context.createDataFrame(test_pd).withColumn("start", F.col("start").cast('timestamp'))
test_df.show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
|2017-04-14 00:00:...|  1.1|
+--------------------+-----+

test_df.filter(F.col("start").between('2017-04-13','2017-04-14')).show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
+--------------------+-----+
MyNameIsCaleb
  • 4,409
  • 1
  • 13
  • 31
Vinay Kolar
  • 913
  • 1
  • 7
  • 13

3 Answers3

28

Found out the answer. pyspark's "between" function is inconsistent in handling timestamp inputs.

  1. If you provide the the input in string format without time, it performs an exclusive search (Not what we expect from the documentation linked above).
  2. If you provide the input as datetime object or with exact time (e.g., '2017-04-14 00:00:00', then it performs an inclusive search.

For the above example, here is the output for exclusive search (use pd.to_datetime):

test_df.filter(F.col("start").between(pd.to_datetime('2017-04-13'),pd.to_datetime('2017-04-14'))).show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
|2017-04-14 00:00:...|  1.1|
+--------------------+-----+

Similarly, if we provide in the date AND time in string format, it seems to perform an inclusive search:

test_df.filter(F.col("start").between('2017-04-13 12:00:00','2017-04-14 00:00:00')).show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
|2017-04-14 00:00:...|  1.1|
+--------------------+-----+
Vinay Kolar
  • 913
  • 1
  • 7
  • 13
  • Interesting point. But the output appears the same. Can you elaborate? – PR102012 Dec 15 '17 at 14:57
  • Above two approaches yield an inclusive search (and hence the same output). However, if we just pass the date as a string (see the question), we get an exclusive search. The point being: if you want inclusive search use either of the above approaches, and do not use just date strings (e.g., `F.between('2017-04-13','2017-04-14')`) – Vinay Kolar Dec 16 '17 at 18:09
  • 2
    @VinayKolar [Source code for between() function](https://github.com/apache/spark/blob/v2.3.0/python/pyspark/sql/column.py#L551) does seem to suggest it's an inclusive search – Yuan Zhao Apr 14 '18 at 03:02
  • This is probably because the filtered column contains time. The search IS inclusive, but "2022-01-01 13:17" is not between "2021-12-31" and "2022-01-01" (which is implicitly "2022-01-01 00:00" – MilitaryCoo Aug 05 '22 at 16:40
12

.between() method is always inclusive. The problem in your example is that when you pass string to .between() method, it treats your data as strings as well. For string comparison, '2017-04-14 00:00:00' is strictly greater than '2017-04-14' because the former is a longer string than the latter, this is why the second date is filtered out in your example. To avoid the "inconsistency", you should pass arguments in datetime format to .between() as follows:

filtered_df = (test_df.filter(F.col("start")
                .between(dt.strptime('2017-04-13 12:00:00', '%Y-%m-%d %H:%M:%S'), 
                         dt.strptime('2017-04-14 00:00:00', '%Y-%m-%d %H:%M:%S'))))

This will produce the expected result:

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
|2017-04-14 00:00:...|  1.1|
+--------------------+-----+
Anna K.
  • 1,452
  • 12
  • 11
  • 1
    _'2017-04-14 00:00:00' is strictly greater than '2017-04-14'_ - It is a String-casting problem, anyhow I think another then you describe. Internally it seems that Spark parses `'2017-04-14'` as `'2017-04-14 00:00:00'` which leaves OPs query non-inclusive for the end-range. This has nothing to do with one String being longer than the other (otherwise you would not get any search result because all timestamps would be larger than your between-condition. – Markus May 06 '21 at 09:06
4

Just to be clear, if you want to get data from a single date it's better to specify the exact time

ex) Retrieve data only on a single day (2017-04-13)

test_df.filter(F.col("start").between('2017-04-13 00:00:00','2017-04-13 23:59:59.59') 

cf) if you set the date as between '2017-04-13', '2017-04-14' this will include 2017-04-14 00:00:00 data also, which technically isn't the data you want to pull out since it's 2017-04-14 data.

dkdlfls26
  • 171
  • 1
  • 9