Is it possible to save files in Hadoop without saving them in local file system?

Question

Is it possible to save files in Hadoop without saving them in local file system? I would like to do something like shown below however I would like to save file directly in HDFS. At the moment I save files in documents directory and only then I can save them in HDFS for instance using hadoop fs -put.

class DataUploadView(GenericAPIView):

    def post(self, request):

            myfile = request.FILES['photo']
            fs = FileSystemStorage(location='documents/')
            filename = fs.save(myfile.name, myfile)
            local_path = 'my/path/documents/' + str(myfile.name)            
            hdfs_path = '/user/user1/' + str(myfile.name)
            run(['hadoop', 'fs', '-put', local_path, hdfs_path], shell=True)

You could forward a byte stream as a WebHDFS request. That is what Hue will do... which is also a Django-like application - https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Create_and_Write_to_a_File — OneCricketeer, Jul 26 '18 at 21:58

score 3 · Answer 1 · edited Jul 28 '18 at 22:48

Hadoop has REST APIs that allow you to create files via WebHDFS.

So you could write your own create based on the REST APIs using a python library like requests for doing the HTTP. However, there are also several python libraries that support Hadoop/HDFS and already use the REST APIs or that use the RPC mechanism via libhdfs.

pydoop
hadoopy
snakebite
pywebhdfs
hdfscli
pyarrow

Just make sure you look for how to create a file rather than having the python library call hdfs dfs -put or hadoop fs -put.

See the following for more information:

I don't think snakebite or a few of those others can perform HDFS put operations — OneCricketeer, Jul 26 '18 at 22:00
I saw that `hdfscli` had that so just to be clear, I added that. — tk421, Jul 26 '18 at 22:20
@tk421 Can you take a look at this question? Thanks in advance. — thedbogh, Jul 29 '18 at 06:03

simleo · Answer 2 · 2018-08-01T16:31:36.343

Here's how to download a file directly to HDFS with Pydoop:

import os
import requests
import pydoop.hdfs as hdfs


def dl_to_hdfs(url, hdfs_path):
    r = requests.get(url, stream=True)
    with hdfs.open(hdfs_path, 'w') as f:
        for chunk in r.iter_content(chunk_size=1024):
            f.write(chunk)


URL = "https://www.python.org/ftp/python/3.7.0/Python-3.7.0.tar.xz"
dl_to_hdfs(URL, os.path.basename(URL))

The above snippet works for a generic URL. If you already have the file as a Django UploadedFile, you can probably use its .chunks method to iterate through the data.

score -2 · Answer 3 · answered Jul 26 '18 at 20:30

-2

Python is installed in your Linux. It can access only local files. It cannot directly access files in HDFS.

In order to save/put the files directly to HDFS, you need to use any of these below:

Spark: Use Dstream for streaming files
Kafka: matter of setting up configuration file. Best for streaming data.
Flume: set up configuration file. Best for static files.

answered Jul 26 '18 at 20:30

Jim Todd

1,488
1
11
15

1

Can I use any of these above when I get file in POST request in Django to save file directly to HDFS without saving file in local system? – thedbogh Jul 26 '18 at 20:43
Pyspark can read and write to HDFS directly. PyArrow can as well. Why is streaming needed at all? – OneCricketeer Jul 28 '18 at 22:49
Just trying to understand to save file into HDFS directly, don't we need streaming in spark??. An RDD can be created from an existing file, parallelize etc.... But, the file should be present in HDFS or any supported FS for that. I guess his question is to save the files from a POST request for which flume or flafka could be best. Please clarify. Thanks – Jim Todd Jul 29 '18 at 13:58

Is it possible to save files in Hadoop without saving them in local file system?

3 Answers3

Linked