1

When I use sc.textFile('*.txt') I read everything.

I'd like to be able to filter out several files.

e.g. How can I read all file except ['bar.txt', 'foo.txt']?

Steven
  • 14,048
  • 6
  • 38
  • 73
  • 3
    Possible duplicate of [How to use regex to include/exclude some input files in sc.textFile?](http://stackoverflow.com/questions/31782763/how-to-use-regex-to-include-exclude-some-input-files-in-sc-textfile) – Yaron Jan 17 '17 at 08:48

2 Answers2

1

This is more of a workaround:

get file list:

import os
file_list = os.popen('hadoop fs -ls <your dir>').readlines()

Filter it:

file_list = [x for x in file_list if (x not in ['bar.txt','foo.txt')
             and x[-3:]=='txt']

Read it:

rdd = sc.textFile(['<your dir>/'+x for x in file list])
Ezer K
  • 3,637
  • 3
  • 18
  • 34
-1

PySpark will skip empty parquet files while reading multiple files from S3. Using S3A when reading files and it will skip empty files. Only condition is there must be some non-empty files. It can't be all empty files.

files_path = 's3a://my-buckket/obj1/obj2/data'
df = spark.read.parquet(files_path)
Yserbius
  • 1,375
  • 12
  • 18