Moving and merging directories in hdfs

Question

I'm changing an hdfs directory structure. The current one is as follows:

.../customers/customers1/2016-05-16-10/lots_of_files1.csv
.../customers/customers2/2016-05-16-10/lots_of_files2.csv
.../customers/customers3/2016-05-16-10/lots_of_files1.csv
.../customers/customers4/2016-05-16-10/...
.../customers/customers5/2016-05-16-10/...
.../customers/customers6/2016-05-16-10/...
.../customers/customers7/2016-05-16-10/...

I'd like to get rid of the customers(1-7):

.../customers/2016-05-16-10/lots_of_files1.csv
.../customers/2016-05-16-10/lots_of_files2.csv
.../customers/2016-05-16-10/lots_of_files1(1).csv

I thought to use snakebite python hdfs library but lots of edge-cases arise: 1. The same date may occur more than once. 2. The name of the csv may occure more than once, but it's data is different and must be moved as well.

How do you achieve it in the cleanest way possible?

score 0 · Answer 1 · answered May 06 '16 at 18:38

0

If you are not worried to keep the file names, you can easily do using Apache Drill. some thing like Apache Drill supports read and write files through SQL. some thing like

create table dfs.`/myfolder/customers/2016-05-16-10` select * from dfs.`/myfolder/customers` where dir1 = '2016-05-16-10';

All the files from /*/2016-05-16-10 will be written to target table.

https://drill.apache.org/docs/

answered May 06 '16 at 18:38

vgunnu

826
8
6

How does it handle csvs with the same name? @vgunnu – TheSilence May 07 '16 at 08:02
Merges all the files in that folder to new files. Similar to Hive – vgunnu May 09 '16 at 14:20

Moving and merging directories in hdfs

1 Answers1