PySpark exclude files from list

Question

When I use sc.textFile('*.txt') I read everything.

I'd like to be able to filter out several files.

e.g. How can I read all file except ['bar.txt', 'foo.txt']?

Possible duplicate of [How to use regex to include/exclude some input files in sc.textFile?](http://stackoverflow.com/questions/31782763/how-to-use-regex-to-include-exclude-some-input-files-in-sc-textfile) — Yaron, Jan 17 '17 at 08:48

score 1 · Accepted Answer · answered Jan 13 '17 at 09:54

This is more of a workaround:

get file list:

import os
file_list = os.popen('hadoop fs -ls <your dir>').readlines()

Filter it:

file_list = [x for x in file_list if (x not in ['bar.txt','foo.txt')
             and x[-3:]=='txt']

Read it:

rdd = sc.textFile(['<your dir>/'+x for x in file list])

score -1 · Answer 2 · edited Oct 17 '19 at 01:25

-1

PySpark will skip empty parquet files while reading multiple files from S3. Using S3A when reading files and it will skip empty files. Only condition is there must be some non-empty files. It can't be all empty files.

files_path = 's3a://my-buckket/obj1/obj2/data'
df = spark.read.parquet(files_path)

edited Oct 17 '19 at 01:25

Yserbius

1,375
12
18

answered Oct 16 '19 at 20:19

mohammed khan

1

PySpark exclude files from list

2 Answers2