Select parquet based on partition date

Question

I've some heavy logs on my cluster, I've parqueted all of them with the following partition schema:

PARTITION_YEAR=2017/PARTITION_MONTH=07/PARTITION_DAY=12

For example, if I want to select all my log between 2017/07/12 and 2017/08/10 is there a way to do it effectively ? Or Do I have to loop over all days to read the partitions one by one ?

Thanks,

https://stackoverflow.com/questions/33650421/reading-dataframe-from-partitioned-parquet-file — pasha701, Sep 04 '17 at 20:56

MaFF · Accepted Answer · 2017-09-04T20:53:49.387

You can use some regular expressions when loading files in pyspark :

input_path = "PARTITION_YEAR=2017/PARTITION_MONTH=0{7/PARTITION_DAY={1[2-9],[2-3]*},8/PARTITION_DAY={0[1-9],10}}"
df = spark.read.parquet(input_path)

You can also generate a list of comma separated paths:

input_path = ",".join(["PARTITION_YEAR=2017/PARTITION_MONTH=07/PARTITION_DAY=" + str(x) for x in range(12, 32)]) \
+ ",".join(["PARTITION_YEAR=2017/PARTITION_MONTH=08/PARTITION_DAY=" + str(x) for x in range(1, 11)])

or using dates:

import datetime as dt
d1 = dt.date(2017,7,12)
d2 = dt.date(2017,8,10)

date_list = [d1 + dt.timedelta(days=x) for x in range(0, (d2 - d1).days + 1)]
input_path = ",".join(["PARTITION_YEAR=2017/PARTITION_MONTH=%02d/PARTITION_DAY=%02d" % (d.month, d.day) for d in  date_list])

Select parquet based on partition date

1 Answers1

Related