I have a PySpark (2.3.0) dataframe with a timestamp type column:
>> df.show()
+-------------------+
| column |
+-------------------+
|2004-02-16 12:01:37|
|2004-02-23 10:28:49|
|2004-02-23 12:49:14|
|2004-02-26 12:29:58|
|2004-03-02 10:10:28|
|2004-03-03 03:40:13|
|2004-03-16 05:00:10|
|2004-03-16 03:28:21|
|2004-03-17 02:45:22|
|2004-03-23 08:14:47|
+-------------------+
>> df.printSchema()
root
|-- column: timestamp (nullable = true)
I want to filter that dataframe to find records on a specific date:
import datetime
date = datetime.datetime.strptime('2018-06-07', '%Y-%m-%d').date()
What is the most efficient method of doing this filtering? Note: the data is read in via JDBC so it may not be distributed.
Here is what I have tried (have not noticed major differences), which is preferable? Have I missed anything?
Method 1: Cast as date
df.filter(psf.col('column').cast('date') == date)
Method 2: match on year, month, dayofmonth
import pyspark.sql.functions as psf
(
df
.filter(psf.dayofmonth('column') == date.day)
.filter(psf.month('column') == date.month)
.filter(psf.year('column') == date.year)
)