When I type
hadoop fs -text /foo/bar/baz.bz2 2>err 1>out
I get two non-empty files: err
with
2015-05-26 15:33:49,786 INFO [main] bzip2.Bzip2Factory (Bzip2Factory.java:isNativeBzip2Loaded(70)) - Successfully loaded & initialized native-bzip2 library system-native
2015-05-26 15:33:49,789 INFO [main] compress.CodecPool (CodecPool.java:getDecompressor(179)) - Got brand-new decompressor [.bz2]
and out
with the content of the file (as expected).
When I call the same command from Python (2.6):
from subprocess import Popen
with open("out","w") as out:
with open("err","w") as err:
p = Popen(['hadoop','fs','-text',"/foo/bar/baz.bz2"],
stdin=None,stdout=out,stderr=err)
print p.wait()
I get the exact same (correct) behavior.
However, when I run the same code under PySpark (or using spark-submit
), I get an empty err
file and the out
file starts with the log messages above (and then it contains the actual data).
NB: the intent of the Python code is to give the output of hadoop fs -text
to another program (i.e., passing stdout=PIPE
to Popen
), so please do not suggest hadoop fs -get
. Thanks.
PS. When I run hadoop
under time
:
from subprocess import Popen
with open("out","w") as out:
with open("err","w") as err:
p = Popen(['/usr/bin/time','hadoop','fs','-text',"/foo/bar/baz.bz2"],
stdin=None,stdout=out,stderr=err)
print p.wait()
the time
output correctly goes to err
, but hadoop
logs incorrectly go to out
.
I.e., hadoop
merges its stderr
into its stdout
when it runs under spark.