Dumping csv logs files from windows server to ubuntu VirtualBox/hadoop/hdfs

Question

We are getting new files everyday from apps in the form of csv gets stored in windows server say c:/program files(x86)/webapps/apachetomcat/.csv each file having different data in it, So is there any hadoop component to transfer files from windows server to hadoop hdfs, I came across flume,kafka but not getting proper example, Can anyone shade light here.

So Each file have separate name and having size upto 10-20mb and the daily file count is more than 200 files, Once the files added to windows server the flume/kafka should able to put that files in hadoop, Later files are imported from HDFS processed by spark and moved to processed files to another folder in HDFS

More details please, size of files? What are you hoping to do with this data? — AM_Hawk, Nov 30 '16 at 18:18

score 1 · Answer 1 · answered Nov 30 '16 at 21:56

1

Flume is the best choice. A flume agent (process) needs to be configured. A flume agent has 3 parts:

Flume source - Place where flume will look for new files. c:/program files(x86)/webapps/apachetomcat/.csv in your case.

Flume sink - Place where flume will send the files. HDFS location in your case.

Flume channel - Temporary location of your file before it is sent to sink. You need to use "File Channel" for your case.

Click here for an example.

answered Nov 30 '16 at 21:56

kashmoney

97
7

Thanks Akash, So i need flume on windows and in linux too?? Can u give me detailed explanation. And sample example – Deno George Dec 01 '16 at 12:56
Yes you would need 2 agents running as shown here https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.3/bk_installing_manually_book/content/installing_flume.html. If you can somehow get the logs to a local HDFS node that would be awesome but if thats not possible then there are some workarounds listed http://stackoverflow.com/questions/26168820/transferring-files-from-remote-node-to-hdfs-with-flume. – kashmoney Dec 01 '16 at 16:39

AM_Hawk · Answer 2 · 2016-11-30T18:28:15.193

As per my comment, more details would help narrow down possibilities, example first thought, move file to server and just create a bash script and schedule with cron.

put

Usage: hdfs dfs -put <localsrc> ... <dst>

Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system.

hdfs dfs -put localfile /user/hadoop/hadoopfile
hdfs dfs -put localfile1 localfile2 /user/hadoop/hadoopdir
hdfs dfs  -put localfile hdfs://nn.example.com/hadoop/hadoopfile
hdfs dfs  -put - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
Exit Code:

Returns 0 on success and -1 on error.

Dumping csv logs files from windows server to ubuntu VirtualBox/hadoop/hdfs

2 Answers2