Pyspark FileSystem fs.listStatus(sc._jvm.org.apache.hadoop.fs.Path(path)) only returns the first sub-directory

Question

I want to walk through a given hdfs path recursively in Pyspark without using hadoop fs -ls [path]. I tried the solution suggested here, but found that listStatus() only returns me the status of the first sub-directory in the given path. According to this documentation, listStatus should return "the statuses of the files/directories in the given path if the path is a directory." What am I missing?

I'm using Hadoop 2.9.2, Spark 2.3.2 and Python 2.7.

Could you post your code - how exactly you are using this function. — Grzegorz Skibinski, Nov 17 '19 at 08:02

score 0 · Accepted Answer · answered Nov 17 '19 at 04:47

0

I couldn't exactly recreate the scenario, but I think it has something to do with the fact that if a path is not a directory, listStatus() on that path will return a list of length 1 containing only the path itself.

answered Nov 17 '19 at 04:47

largecats

195
1
14

Pyspark FileSystem fs.listStatus(sc._jvm.org.apache.hadoop.fs.Path(path)) only returns the first sub-directory

1 Answers1