i have a managed file system where there are 7 million records. i have to get max value for one of the column of date.
import pyspark
import dataiku
from dataiku import spark as dkuspark
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
some_table = dataiku.Dataset("some_table")
df = dkuspark.get_dataframe(sqlContext, some_table)
**df.select('date').unique().collect()**
this above highlighted line takes lot of memory to run.
can i run it on RDD pyspark?