Is there a way to list the directories in a using PySpark in a notebook?

Question

I'm trying to see every file is a certain directory, but since each file in the directory is very large, I can't use sc.wholeTextfile or sc.textfile. I wanted to just get the filenames from them, and then pull the file if needed in a different cell. I can access the files just fine using Cyberduck and it shows the names on there.

Ex: I have the link for one set of data at "name:///mainfolder/date/sectionsofdate/indiviual_files.gz", and it works, But I want to see the names of the files in "/mainfolder/date" and in "/mainfolder/date/sectionsofdate" without having to load them all in via sc.textFile or sc.Wholetextfile. Both those functions work, so I know my keys are correct, but it takes too long for them to be loaded.

Please see "[ask]", "[Stack Overflow question checklist](https://meta.stackoverflow.com/questions/260648)" and all their linked pages along with "[How To Ask Questions The Smart Way](http://catb.org/esr/faqs/smart-questions.html)" — the Tin Man, Jun 29 '20 at 05:19
@Saurabh I am pretty positive it is a amazon-s3 bucket, but any documentation they had on their website wasn't helpful in my problem as it kept saying "incorrect credentials". I know my access keys are right and I can pull the data from the one link I have, but I want an easy way to find the names of the rest. — Gurman Mann, Jun 29 '20 at 06:11

score 0 · Answer 1 · answered Jun 28 '20 at 12:18

0

Considering that the list of files can be retrieve by one single node, you can just list the files in the directory. Look at this response.

wholeTextFiles returns a tuple (path, content) but I don't know if the file content is lazy to get only the first part of the tuple.

answered Jun 28 '20 at 12:18

Daniel Argüelles

2,229
1
33
56

Tha solution didn't work for me. it kept giving me errors saying "no Filesystem for scheme". I have a valid link to the database, but I want to look at the other directories like Ive been able to on CyberDuck – Gurman Mann Jun 28 '20 at 22:36

Is there a way to list the directories in a using PySpark in a notebook?

1 Answers1