0

We are getting new files everyday from apps in the form of csv gets stored in windows server say c:/program files(x86)/webapps/apachetomcat/.csv each file having different data in it, So is there any hadoop component to transfer files from windows server to hadoop hdfs, I came across flume,kafka but not getting proper example, Can anyone shade light here.

So Each file have separate name and having size upto 10-20mb and the daily file count is more than 200 files, Once the files added to windows server the flume/kafka should able to put that files in hadoop, Later files are imported from HDFS processed by spark and moved to processed files to another folder in HDFS

Deno George
  • 352
  • 1
  • 3
  • 19

2 Answers2

1

Flume is the best choice. A flume agent (process) needs to be configured. A flume agent has 3 parts:

Flume source - Place where flume will look for new files. c:/program files(x86)/webapps/apachetomcat/.csv in your case.

Flume sink - Place where flume will send the files. HDFS location in your case.

Flume channel - Temporary location of your file before it is sent to sink. You need to use "File Channel" for your case.

Click here for an example.

kashmoney
  • 97
  • 7
  • Thanks Akash, So i need flume on windows and in linux too?? Can u give me detailed explanation. And sample example – Deno George Dec 01 '16 at 12:56
  • Yes you would need 2 agents running as shown here https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.3/bk_installing_manually_book/content/installing_flume.html. If you can somehow get the logs to a local HDFS node that would be awesome but if thats not possible then there are some workarounds listed http://stackoverflow.com/questions/26168820/transferring-files-from-remote-node-to-hdfs-with-flume. – kashmoney Dec 01 '16 at 16:39
0

As per my comment, more details would help narrow down possibilities, example first thought, move file to server and just create a bash script and schedule with cron.

put

Usage: hdfs dfs -put <localsrc> ... <dst>

Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system.

hdfs dfs -put localfile /user/hadoop/hadoopfile
hdfs dfs -put localfile1 localfile2 /user/hadoop/hadoopdir
hdfs dfs  -put localfile hdfs://nn.example.com/hadoop/hadoopfile
hdfs dfs  -put - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
Exit Code:

Returns 0 on success and -1 on error.
AM_Hawk
  • 661
  • 1
  • 15
  • 33