1

I'm using a python script where I'm using a shell command to copy from local to hdfs.

import os
import logging
import subprocess


filePath = "/tmp"
keyword = "BC10^Dummy-Segment"
for root, dirs, files in os.walk(filePath):
    for file in files:
        if keyword in file:
            subprocess.call(["hadoop fs -copyFromLocal /tmp/BC10%5EDummy-Segment* /user/app"], shell=True)
            subprocess.call(["hadoop fs -rm /tmp/BC10%5EDummy-Segment*"], shell=True)

I'm seeing this error:

copyFromLocal: `/tmp/BC10^Dummy-Segment*': No such file or directory
rm: `/tmp/BC10^Dummy-Segment_2019': No such file or directory

Updated code:

import glob
import subprocess
import os
from urllib import urlencode, quote_plus

filePath = "/tmp"
keyword = "BC10^Dummy-Segment"

wildcard = os.path.join(filePath, '{0}*'.format(keyword))
print(wildcard)
files = [urlencode(x, quote_via=quote_plus) for x in  glob.glob(wildcard)]
subprocess.check_call(["hadoop", "fs", "-copyFromLocal"] + files + ["/user/app"])
#subprocess.check_call(["hadoop", "fs", "-rm"] + files)

Seeing error when I run:

Traceback (most recent call last):
  File "ming.py", line 11, in <module>
    files = [urlencode(x, quote_via=quote_plus) for x in  glob.glob(wildcard)]
TypeError: urlencode() got an unexpected keyword argument 'quote_via'
Hua Cha
  • 107
  • 1
  • 9
  • The real file will have ```BC10^Dummy-Segment``` followed with a timestamp so I wanted to fetch all the files beginning with this keyword. – Hua Cha Sep 11 '19 at 15:58
  • 1
    Are you sure that these files exist? Because this might simply be caused by the * symbol matching no files at all. – rje Sep 11 '19 at 16:00
  • When you `shell=True` there is no need to pass an array, passing the command as a string would suffice – geckos Sep 11 '19 at 16:03
  • Could you post a file name (full path) that you know it's there? – CristiFati Sep 11 '19 at 16:05
  • 1
    You are running rm on a loop, it will remove the files at first iteration and fail at next. For your use case you don't need to loop, in fact for this a simple shell script with that two commands would suffice – geckos Sep 11 '19 at 16:08
  • @geckos, well, there's good reason to use a list with `shell=True`, but only if someone is Doing The Right Thing and passing data out-of-band from code. `subprocess.call(['hadoop fs -copyFromLocal "$1"* "$2"', '_', os.path.join(filePath, keyword), filePath)` is an example of what that might look like; the code in the first array element can refer to literal data passed in subsequent ones. – Charles Duffy Sep 11 '19 at 16:09
  • @HuaCha, `%5E` being unescaped to `^` is a Hadoop thing, not a Python thing or a shell thing. The tags you used aren't appropriate for getting assistance on this matter. – Charles Duffy Sep 11 '19 at 16:11
  • Right -- I'm not saying that it's *wrong* to use `%5E`, just that it's a Hadoop thing, so when you use Python and shell tags, the people here can't really help. – Charles Duffy Sep 11 '19 at 16:13
  • @HuaCha, ...by contrast, if you're asking about the backtick before the string and the forward-quote after it, that's a standard-ish way to escape strings in error messages that UNIX-y tools have been conventionally doing for a very long time; it doesn't have anything to do with your actual code's behavior. – Charles Duffy Sep 11 '19 at 16:14
  • If you want to see what the shell is actually passing to the `hadoop` command, by the way,try putting `set -x;` at the front of the command you're passing to `subprocess.call()` -- you can compare that to what the same command does when prefixed by `set -x;` at an interactive command line to see if there's anything surprising going on. – Charles Duffy Sep 11 '19 at 16:16
  • @CristiFati ```/tmp/BC10^Dummy-Segment_2019``` This is the file name – Hua Cha Sep 11 '19 at 16:16
  • @geckos Wouldn't the copyFromLocal move all the files with this keyword first and then the rm would run? Or could I just place the rm outside of the loop? – Hua Cha Sep 11 '19 at 16:18
  • I don't really know hadoop. `hadoop fs -rm /tmp/BC10%5EDummy-Segment*` remove all files prefixed by `/tmp/BC10%5EDummy-Segment`? If so you can't run this twice right? It will remove the files at first run a and then fail in the next. – geckos Sep 11 '19 at 16:22
  • The case here is that you're looping over the files, but you dont use `root, dirs, file` in the loop body, so, why looping at all? – geckos Sep 11 '19 at 16:23

1 Answers1

1

I'm guessing you are URL-encoding the path to pass it properly to Hadoop, but in doing so you basically hide it from the shell. There really are no files matching the wildcard /tmp/BC10%5EDummy-Segment* where % etc are literal characters.

Try handling the glob from Python instead. With that, you can also get rid of that pesky shell=True; and with that change, it is finally actually correct and useful to pass the commands as a list of strings (never a list of a singe space-separated string, and with shell=True, don't pass a list at all). Notice also the switch to check_call so we trap errors and don't delete the source files if copying them failed. (See also https://stackoverflow.com/a/51950538/874188 for additional rationale.)

import glob
import subprocess
import os
from urllib import quote_plus

filePath = "/tmp"
keyword = "BC10^Dummy-Segment"

wildcard = os.path.join(filePath, '{0}*'.format(keyword))
files = [quote_plus(x) for x in  glob.glob(wildcard)]
subprocess.check_call(["hadoop", "fs", "-copyFromLocal"] + files + ["/user/app"])
subprocess.check_call(["hadoop", "fs", "-rm"] + files)

This will not traverse subdirectories; but neither would your attempt with os.walk() do anything actually useful if it found files in subdirectories. If you actually want that to happen, please explain in more detail what the script should do.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Hi I'm seeing this error a ```SyntaxError: invalid syntax``` with an arrow pointing to the "s" in ```subprocess.check_call(["hadoop", "fs", "-copyFromLocal"] + files + ["/user/app"])``` – Hua Cha Sep 11 '19 at 16:32
  • _I'm guessing you are URL-encoding the path to pass it properly to Hadoop, but in doing so you basically hide it from the shell. There really are no files matching the wildcard /tmp/BC10%5EDummy-Segment* where % etc are literal characters_ good catch! – geckos Sep 11 '19 at 16:33
  • @tripleee Sorry, one more error. ```from urllib.parse import urlencode, quote_plus``` ```ImportError: No module named parse``` – Hua Cha Sep 11 '19 at 16:44
  • I'm using Python 2.6.6 – Hua Cha Sep 11 '19 at 16:54
  • Eww, really? You seriously should upgrade. The `urlquote` module was called something else in Python 2. https://stackoverflow.com/questions/5607551/how-to-urlencode-a-querystring-in-python seems suitably ancient. – tripleee Sep 11 '19 at 17:04
  • 2
    In Python 2, use `from urllib import urlencode, quote_plus`. – Martijn Pieters Sep 11 '19 at 17:28
  • (but what you really, *really* should do is switch to Python 3 instead) – Martijn Pieters Sep 11 '19 at 17:30
  • @MartijnPieters I asked to upgrade but that request was shot down. Even with the updates, I'm still seeing an error with ```TypeError: urlencode() got an unexpected keyword argument 'quote_via'``` Any ideas? – Hua Cha Sep 11 '19 at 18:11
  • @tripleee At the ```files = [urlencode(x, quote_via=quote_plus) for x in glob.glob(wildcard)]``` line, I'm seeing ```TypeError: urlencode() got an unexpected keyword argument 'quote_via'``` Any ideas? – Hua Cha Sep 11 '19 at 19:21
  • By quick googling `files = [quote_plus(x) for x in glob.glob(wildcard)]` seems to work. Updated answer. Sorry for the mess; though you should probably have been more explicit up front with the fact that you are on Python 2 (and several other things, really). – tripleee Sep 12 '19 at 04:48