0

I have directories/files in S3 in the below structure.

root/
    20180101/files.txt
    20180102/files.txt
    20180103/files.txt

Now i want to pass a date range as start_date=20180101 and end_date=20180102 . I want the pyspark code to read files from these directories included in the range. How can i achieve this.

**The range is configurable, i.e it can be 1 week/30days/90days

darkmatter
  • 125
  • 1
  • 2
  • 10

1 Answers1

0

I created a list of paths of the date range and passed to sc.text().

start = datetime.datetime.strptime(start_date, '%Y%m%d')
end = datetime.datetime.strptime(end_date, '%Y%m%d')
step = datetime.timedelta(days=1)
paths = []

while start <= end:
    paths.append(s3_input_path+str(start.date().strftime("%Y%m%d"))+"/")
    start += step
str1 = ','.join(paths)
darkmatter
  • 125
  • 1
  • 2
  • 10