0

I know I can do this:

data = sc.textFile('/hadoop_foo/a')
data.count()
240
data = sc.textFile('/hadoop_foo/*')
data.count()
168129

However, I would like to count the size of the data of every subdirectory of "/hadoop_foo/". Can I do that?

In other words, what I want is something like this:

subdirectories = magicFunction()
for subdir in subdirectories:
  data sc.textFile(subdir)
  data.count()

I tried with:

In [9]: [x[0] for x in os.walk("/hadoop_foo/")]
Out[9]: []

but I think that fails, because it searches at the local directory of the driver (the gateway in that case), while "/hadoop_foo/" lies in the . Same for "hdfs:///hadoop_foo/".


After reading How can I list subdirectories recursively for HDFS?, I am wondering if there is a way to execute:

hadoop dfs -lsr /hadoop_foo/

in code..


From Correct way of writing two floats into a regular txt:

In [28]: os.getcwd()
Out[28]: '/homes/gsamaras'  <-- which is my local directory
Community
  • 1
  • 1
gsamaras
  • 71,951
  • 46
  • 188
  • 305

1 Answers1

1

With python use hdfs module; walk() method can get you list of files.

The code sould look something like this:

from hdfs import InsecureClient

client = InsecureClient('http://host:port', user='user')
for stuff in client.walk(dir, 0, True):
...

With Scala you can get the filesystem (val fs = FileSystem.get(new Configuration())) and run https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/fs/FileSystem.html#listFiles(org.apache.hadoop.fs.Path, boolean)

You can also execute a shell command from your script with os.subprocess but this is never a recommended approach since you depend on text output of a shell utility here.


Eventually, what worked for the OP was using subprocess.check_output():

subdirectories = subprocess.check_output(["hadoop","fs","-ls", "/hadoop_foo/"])
gsamaras
  • 71,951
  • 46
  • 188
  • 305
patrungel
  • 843
  • 8
  • 13
  • Is this [tag:java] or [tag:python]? I am using [tag:python]. Also it's not clear what follows run, a link? I mean what I exactly should I do in my code? `subprocess.call(["/hadoop_foo/"])` says it cannot find main class `dfs`. Same with `fs` in place of `dfs`. I also tried "hadoop dfs" in the 1st part, and got `OSError: [Errno 2] No such file or directory` – gsamaras Sep 10 '16 at 00:08
  • Edited the answer to provide python way of getting files recursively. – patrungel Sep 10 '16 at 00:45
  • I saw it, thank you. `No module named hdfs`, I am doomed.. '_', but +1 for providing options! If you want upvote the question to bring more people here, that may be able to help! :) – gsamaras Sep 10 '16 at 00:51
  • re os.subprocess.call() ; can you try hdfs dfs -lsr command, not hadoop dfs? subprocess.call(["hdfs","dfs","-lsr", directory]) – patrungel Sep 10 '16 at 01:01
  • WOW, it actually listed the directories! However, this listed and the part-xxxxx files of Spark, i.e. "/hadoop_foo/a/part-00000" and so on, any idea to get only "/hadoop_foo/a/", "/hadoop_foo/b/" and so on? If not that's okay, just let me know though! :) Oh, just remove the `r`! ;) Update the answer so that I can accept! :D But do you know how I can use them? If I do `subdirectories = subprocess.call(["hdfs","dfs","-ls", /hadoop_foo/])`, the `subdirectories` will be empty, and the subdirectories will be printed, instead of getting assigned into the list.. – gsamaras Sep 10 '16 at 01:04
  • No "directories only" options for hdfs dfs -lsr; when in shell people grep by 'drwx' afterwards; with python, I would grab all lines and then filter. If you can install python's hdfs module; there's more luck with it (walk() function returns iterator on paths among others). – patrungel Sep 10 '16 at 01:16
  • I cannot install anything, no rights at all. As I updated, you can do that with `-ls`, instead of `-lsr`. So now the only thing is how to store that output and then loop over it, so that I can read the RDD in every loop and count it! – gsamaras Sep 10 '16 at 01:18
  • subprocess.check_output() to have stdout returned. – patrungel Sep 10 '16 at 01:19