1

I am running a code in python which calculates the count of the files present in a directory`

hadoop fs -count /user/a909983/sample_data/ | awk '{print $2}'

This successfully returns 0 in the linux command line as the dir is empty.However when I run this in python script it returns 1.The line of code in python is:

directoryEmptyStatusCommand = subprocess.call(
["hadoop", "fs", "-count", "/user/a909983/sample_data/", "|", "awk '{print $2}'"])

How can I correct this? or what am I missing ?. I have also tried using Popen, but the result is the same.

tarun kumar Sharma
  • 897
  • 11
  • 21
  • If you want to use the pipe function `|` of the shell, you need to run with `shell=True` as option. In that case you should use a string, rather than a list for the command. However, it would be better to split this command in two subprocesses, for `hadoop` and `awk` respectively and then pipe the data through Python. – JohanL Oct 08 '18 at 15:46
  • @tarun , please look at the given answer below if helps you then accept as an answer so it can be removed from the un-accepted ans queue. – Karn Kumar Oct 08 '18 at 16:17
  • @pygo It would be even better if you just remove your answer, since it is a duplicate anyway. – JohanL Oct 08 '18 at 18:10

1 Answers1

3

Use subprocess.Popen and don't use the pipe | because it requires shell=True which security risk. So, use the subprocess.PIPE and use that with subprocess.check_output without pipe thats the correct method.

So, you can try something like:

command = subprocess.Popen(("hadoop", "fs", "-count", "/user/a909983/sample_data/") , stdout=subprocess.PIPE)
output = subprocess.check_output(("awk '{print $2}'"), stdin=command.stdout)

In Case You want to try Shell commands by enabling shell=True:

cmd = "hadoop fs -count /user/a909983/sample_data/ | awk '{print $2}'"
command = subprocess.Popen(cmd,shell=True,stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
output = command.communicate()[0]
print(output)
Karn Kumar
  • 8,518
  • 3
  • 27
  • 53
  • If the filename comes from a variable (which it probably does!), just setting `shell=True` isn't great practice without moving that variable out-of-band from content parsed as code. – Charles Duffy Oct 08 '18 at 16:20
  • 1
    Consider `subprocess.Popen(['''hadoop fs -count "$1" | awk '{print $2}' ''', '_', '/user/a909983/sample_data/'], shell=True)`, keeping your data -- the filename -- out-of-band from your code. – Charles Duffy Oct 08 '18 at 16:21
  • @CharlesDuffy, agreed `shell=True` indeed not recommended at all as it open up a security hole and makes a program vulnerable to shell injection, as a security expert you know better it ;-) – Karn Kumar Oct 08 '18 at 16:24
  • Well -- point I'm making is that one *can* use `shell=True` safely, if the string passed as the first element of `cmd` is a constant that was carefully audited by a human, and all elements that could vary are kept out-of-band (and there hasn't been any meddling with environment variables). But yes, avoiding it altogether is indeed the best approach. :) – Charles Duffy Oct 08 '18 at 16:33