I know I can do this:
data = sc.textFile('/hadoop_foo/a')
data.count()
240
data = sc.textFile('/hadoop_foo/*')
data.count()
168129
However, I would like to count the size of the data of every subdirectory of "/hadoop_foo/". Can I do that?
In other words, what I want is something like this:
subdirectories = magicFunction()
for subdir in subdirectories:
data sc.textFile(subdir)
data.count()
I tried with:
In [9]: [x[0] for x in os.walk("/hadoop_foo/")]
Out[9]: []
but I think that fails, because it searches at the local directory of the driver (the gateway in that case), while "/hadoop_foo/" lies in the hdfs. Same for "hdfs:///hadoop_foo/".
After reading How can I list subdirectories recursively for HDFS?, I am wondering if there is a way to execute:
hadoop dfs -lsr /hadoop_foo/
in code..
From Correct way of writing two floats into a regular txt:
In [28]: os.getcwd()
Out[28]: '/homes/gsamaras' <-- which is my local directory