2

I am running a big job, in cluster mode. However, I am only interested in two floats numbers, which I want to read somehow, when the job succeeds.

Here what I am trying:

from pyspark.context import SparkContext

if __name__ == "__main__":
    sc = SparkContext(appName='foo')

    f = open('foo.txt', 'w')
    pi = 3.14
    not_pi = 2.79 
    f.write(str(pi) + "\n")
    f.write(str(not_pi) + "\n")
    f.close()

    sc.stop()

However, 'foo.txt' doesn't appear to be written anywhere (probably it gets written in an executor, or something). I tried '/homes/gsamaras/foo.txt', which would be the pwd of the gateway. However, it says: No such file or directory: '/homes/gsamaras/myfile.txt'.

How to do that?


import os, sys
import socket
print "Current working dir : %s" % os.getcwd()
print(socket.gethostname())

suggest that the driver is actually a node of the cluster, that's why I don't see the file in my gateway.

Maybe write the file in the HDFS somehow?

This won't work either:

Traceback (most recent call last):
  File "computeCostAndUnbalancedFactorkMeans.py", line 15, in <module>
    f = open('hdfs://myfile.txt','w')
IOError: [Errno 2] No such file or directory: 'hdfs://myfile.txt'
Community
  • 1
  • 1
gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • maybe '/homes/gsamaras/foo.txt', you miss the leading slash. – citaret Sep 03 '16 at 04:41
  • 1
    This looks OK and should be written on the driver node. If you're in doubt just log `os.getcwd()` and `socket.gethostname()`. – zero323 Sep 03 '16 at 11:35
  • citaret, typo! :) @zero323 turns out my gateway is not my driver, that's why this wouldn't work..See my edit! – gsamaras Sep 06 '16 at 15:54
  • @zero323 I had to write to HDFS, using a directory present in the cluster..If I were you, I would post an answer, expanding a bit on your comment, mentioning that my code didn't write the file in my local filesystem, because it was using as a driver a node of the cluster. That explains why`foo.txt` gets created, but `/homes/gsamaras/foo.txt` doesn't (since there is no such directory in any node of the cluster). – gsamaras Sep 06 '16 at 21:29

1 Answers1

1

At the first glance there is nothing particularly (you should context manager in case like this instead of manually closing but it is not the point) wrong with your code. If this script is passed to spark-submit file will be written to the directory local to the driver code.

If you submit your code in the cluster mode it will be an arbitrary worker node in your cluster. If you're in doubt you can always log os.getcwd() and socket.gethostname() to figure out which machine is used and what is the working directory.

Finally you cannot use standard Python IO tools to write to HDFS. There a few tools which can achieve that including native dask/hdfs3.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • Thanks zero323. BTW, can you take a look and maybe accept my [Topic Request](http://stackoverflow.com/documentation/apache-spark) for Spark? You see, this [answer](http://meta.stackoverflow.com/questions/332668/did-i-just-lose-my-hard-worked-newly-created-topic) has all the things I have written back then, hidden in the edit section! :) – gsamaras Sep 07 '16 at 18:15
  • To be honest I avoid docs if can and I don't really have that much power there. I can delete topic, but to create one, I would have to add my own examples. – zero323 Sep 07 '16 at 22:09
  • 1
    Did you? OK. So it looks I can approve. I am not sure why we need Spark docs here at all but here you are. – zero323 Sep 07 '16 at 22:17