1

I have files in a machine (say A) which is not part of the Hadoop (OR HDFS) datacenter. So machine A is at remote location from HDFS datacenter.

Is there a script OR command OR program OR tool that can run in machines which are connected to Hadoop (part of the datacenter) and pull-in the file from machine A to HDFS directly ? If yes, what is the best and fastest way to do this ?

I know there are many ways like WebHDFS, Talend but they need to run from Machine A and requirement is to avoid that and run it in machines in datacenter.

Kalmesh Sam
  • 71
  • 2
  • 10

2 Answers2

2

There are two ways to achieve this:

  1. You can pull the data using scp and store it in a temporary location, then copy it to hdfs, and delete the temporarily stored data.

  2. If you do not want to keep it as a 2-step process, you can write a program which will read the files from the remote machine, and write it to HDFS directly.

    This question along with comments and answers would come in handy for reading the file while, you can use the below snippet to write to HDFS.

    outFile = <Path to the the file including name of the new file> //e.g. hdfs://localhost:<port>/foo/bar/baz.txt
    
    FileSystem hdfs =FileSystem.get(new URI("hdfs://<NameNode Host>:<port>"), new Configuration());
    Path newFilePath=new Path(outFile);
    FSDataOutputStream out = hdfs.create(outFile);
    
    // put in a while loop here which would read until EOF and write to the file using below statement
    out.write(buffer);
    

    Let buffer = 50 * 1024, if you have enough IO capicity depending on processor or you could use a much lower value like 10 * 1024 or something

Community
  • 1
  • 1
Harman
  • 751
  • 1
  • 9
  • 31
  • It is usual 2 step process and this is what I want to avoid. I want to copy it directly from remote machine to HDFS, but the program/script should run in cluster machine and not in remote machine. – Kalmesh Sam Feb 02 '15 at 10:58
  • This would run in the cluster machine and not remote machine, and is the easiest way out. Why do you want to avoid 2-step process ? Is there some specific use-case ? – Harman Feb 02 '15 at 11:00
1

Please tell me if I am getting your Question right way. 1-you want to copy the file in a remote location. 2- client machine is not a part of Hadoop cluster. 3- It is may not contains the required libraries for Hadoop.

Best way is webHDFS i.e. Rest API

Kiranb
  • 31
  • 2