2

I have a script that reads in text line by line, modifies the line slightly, and then outputs the line to a file. I can read the text into the file fine, the problem is that I cannot output the text. Here is my code.

cat = subprocess.Popen(["hadoop", "fs", "-cat", "/user/test/myfile.txt"], stdout=subprocess.PIPE)
for line in cat.stdout:
    line = line+"Blah";
    subprocess.Popen(["hadoop", "fs", "-put", "/user/test/moddedfile.txt"], stdin=line)

This is the error I am getting.

AttributeError: 'str' object has no attribute 'fileno'
cat: Unable to write to output stream.
jfs
  • 399,953
  • 195
  • 994
  • 1,670
user2254180
  • 844
  • 13
  • 30
  • `hadoop fs -put` copy files from local file system info hdfs. It won't work as expected in your code. – emcpow2 Mar 12 '14 at 11:24
  • 1
    See this: http://stackoverflow.com/questions/163542/python-how-do-i-pass-a-string-into-subprocess-popen-using-the-stdin-argument – juniper- Mar 12 '14 at 11:27
  • Hi Eduard, is there an alternative way that let's me create a file from Python in HDFS? Could I use touchz to create the file, then pipe input to it? – user2254180 Mar 12 '14 at 11:29

2 Answers2

6

stdin argument doesn't accept a string. It should be PIPE, None or an existing file (something with valid .fileno() or an integer file descriptor).

from subprocess import Popen, PIPE

cat = Popen(["hadoop", "fs", "-cat", "/user/test/myfile.txt"],
            stdout=PIPE, bufsize=-1)
put = Popen(["hadoop", "fs", "-put", "-", "/user/test/moddedfile.txt"],
            stdin=PIPE, bufsize=-1)
for line in cat.stdout:
    line += "Blah"
    put.stdin.write(line)

cat.stdout.close()
cat.wait()
put.stdin.close()
put.wait()
jfs
  • 399,953
  • 195
  • 994
  • 1,670
2

Hard and quick way to make work your code:

import subprocess
from tempfile import NamedTemporaryFile

cat = subprocess.Popen(["hadoop", "fs", "-cat", "/user/test/myfile.txt"],
                       stdout=subprocess.PIPE)

with NamedTemporaryFile() as f:
    for line in cat.stdout:
        f.write(line + 'Blah')

    f.flush()
    f.seek(0)

    cat.wait()

    put = subprocess.Popen(["hadoop", "fs", "-put", f.name,  "/user/test/moddedfile.txt"],
                           stdin=f)
    put.wait()

But I suggest You to look at hdfs/webhdfs python libraries.

For example pywebhdfs.

emcpow2
  • 852
  • 6
  • 19
  • 1
    You shouldn't provide both the filename and `stdin` parameter set to the same file. [If you use `stdin` then set the filename to `-`](http://stackoverflow.com/a/22354776/4279). – jfs Mar 12 '14 at 14:37