0

i have a managed file system where there are 7 million records. i have to get max value for one of the column of date.

import pyspark
import dataiku
from dataiku import spark as dkuspark
from pyspark import SparkContext
from pyspark.sql import SQLContext

sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

some_table = dataiku.Dataset("some_table")
df = dkuspark.get_dataframe(sqlContext, some_table)
**df.select('date').unique().collect()**

this above highlighted line takes lot of memory to run.

can i run it on RDD pyspark?

James Z
  • 12,209
  • 10
  • 24
  • 44
Shrinivas Mese
  • 69
  • 1
  • 1
  • 8
  • Please use universal measurements instead of local words like *lakh* that not everyone understands. – James Z Jul 17 '23 at 09:39

0 Answers0