When I use sc.textFile('*.txt')
I read everything.
I'd like to be able to filter out several files.
e.g. How can I read all file except ['bar.txt', 'foo.txt']
?
When I use sc.textFile('*.txt')
I read everything.
I'd like to be able to filter out several files.
e.g. How can I read all file except ['bar.txt', 'foo.txt']
?
This is more of a workaround:
get file list:
import os
file_list = os.popen('hadoop fs -ls <your dir>').readlines()
Filter it:
file_list = [x for x in file_list if (x not in ['bar.txt','foo.txt')
and x[-3:]=='txt']
Read it:
rdd = sc.textFile(['<your dir>/'+x for x in file list])
PySpark will skip empty parquet files while reading multiple files from S3. Using S3A when reading files and it will skip empty files. Only condition is there must be some non-empty files. It can't be all empty files.
files_path = 's3a://my-buckket/obj1/obj2/data'
df = spark.read.parquet(files_path)