1

How can we select the dates with end of month only using pyspark ?

Setup

import numpy as np
import pandas as pd


import pyspark
spark = pyspark.sql\
          .SparkSession\
          .builder\
          .appName('app')\
          .getOrCreate()

# sql
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark import SparkConf, SparkContext, SQLContext

sc = spark.sparkContext
sqlContext = SQLContext(sc)

df = pd.DataFrame({
    'id': np.random.randint(0,100000,365*3),
    'date': pd.date_range('2010-01-01',periods=365*3)
})


df.head()
sdf = sqlContext.createDataFrame(df)
sdf.printSchema()
sdf.show(5)

# create table
sdf.createOrReplaceTempView("MyTable")
spark.sql("select * from MyTable limit 2").show()

My Attempt

q = """
SELECT id,date
FROM MyTable
WHERE DAY(DATE_ADD(date, INTERVAL 1 DAY)) = 1
ORDER BY id

"""

spark.sql(q).show()

The query q works in usual SQL, but does not work in pyspark.
How to make it work?

Related Links

BhishanPoudel
  • 15,974
  • 21
  • 108
  • 169

0 Answers0