-1

I'm trying to see every file is a certain directory, but since each file in the directory is very large, I can't use sc.wholeTextfile or sc.textfile. I wanted to just get the filenames from them, and then pull the file if needed in a different cell. I can access the files just fine using Cyberduck and it shows the names on there.

Ex: I have the link for one set of data at "name:///mainfolder/date/sectionsofdate/indiviual_files.gz", and it works, But I want to see the names of the files in "/mainfolder/date" and in "/mainfolder/date/sectionsofdate" without having to load them all in via sc.textFile or sc.Wholetextfile. Both those functions work, so I know my keys are correct, but it takes too long for them to be loaded.

  • Do you want to list the files from the `amazon-s3` bucket? – Saurabh Jun 29 '20 at 05:18
  • Please see "[ask]", "[Stack Overflow question checklist](https://meta.stackoverflow.com/questions/260648)" and all their linked pages along with "[How To Ask Questions The Smart Way](http://catb.org/esr/faqs/smart-questions.html)" – the Tin Man Jun 29 '20 at 05:19
  • @Saurabh I am pretty positive it is a amazon-s3 bucket, but any documentation they had on their website wasn't helpful in my problem as it kept saying "incorrect credentials". I know my access keys are right and I can pull the data from the one link I have, but I want an easy way to find the names of the rest. – Gurman Mann Jun 29 '20 at 06:11

1 Answers1

0

Considering that the list of files can be retrieve by one single node, you can just list the files in the directory. Look at this response.

wholeTextFiles returns a tuple (path, content) but I don't know if the file content is lazy to get only the first part of the tuple.

Daniel Argüelles
  • 2,229
  • 1
  • 33
  • 56
  • Tha solution didn't work for me. it kept giving me errors saying "no Filesystem for scheme". I have a valid link to the database, but I want to look at the other directories like Ive been able to on CyberDuck – Gurman Mann Jun 28 '20 at 22:36