What's the simplest/fastest way to get the partition keys? Ideally into a python list.
Ultimately want to use is this to not process data from partitions that have already been processed. So in the example below only want to process data from day 3. But there may be more than 1 day to process.
Lets say the directory structure is
date_str=2010-01-01
date_str=2010-01-02
date_str=2010-01-03
Reading the dataframe with partition information
ddf2 = spark.read.csv("data/bydate")
Solutions I have tried below. They look excessively wordy and not sure if they are fast. The query shouldn't read any data since it just needs to check directory keys.
from pyspark.sql import functions as F
ddf2.select(F.collect_set('date_str').alias('date_str')).first()['date_str']
# seems to work well albeit wordy
ddf2.select("date_str").distinct().collect()
# [Row(date_str=datetime.date(2010, 1, 10)), Row(date_str=datetime.date(2010, 1, 7)),
# not a python list and slow?
ddf2.createOrReplaceTempView("intent")
spark.sql("""show partitions intent""").toPandas()
# error
ddf2.rdd.getNumPartitions()
# not returning the keys, just the number, which isn't even all the keys
Convert distinct values in a Dataframe in Pyspark to a list
PySpark + Cassandra: Getting distinct values of partition key
pyspark - getting Latest partition from Hive partitioned column logic