0

I'm changing an hdfs directory structure. The current one is as follows:

.../customers/customers1/2016-05-16-10/lots_of_files1.csv
.../customers/customers2/2016-05-16-10/lots_of_files2.csv
.../customers/customers3/2016-05-16-10/lots_of_files1.csv
.../customers/customers4/2016-05-16-10/...
.../customers/customers5/2016-05-16-10/...
.../customers/customers6/2016-05-16-10/...
.../customers/customers7/2016-05-16-10/...

I'd like to get rid of the customers(1-7):

.../customers/2016-05-16-10/lots_of_files1.csv
.../customers/2016-05-16-10/lots_of_files2.csv
.../customers/2016-05-16-10/lots_of_files1(1).csv

I thought to use snakebite python hdfs library but lots of edge-cases arise: 1. The same date may occur more than once. 2. The name of the csv may occure more than once, but it's data is different and must be moved as well.

How do you achieve it in the cleanest way possible?

TheSilence
  • 342
  • 1
  • 3
  • 11

1 Answers1

0

If you are not worried to keep the file names, you can easily do using Apache Drill. some thing like Apache Drill supports read and write files through SQL. some thing like

create table dfs.`/myfolder/customers/2016-05-16-10` select * from dfs.`/myfolder/customers` where dir1 = '2016-05-16-10';

All the files from /*/2016-05-16-10 will be written to target table.

https://drill.apache.org/docs/

vgunnu
  • 826
  • 8
  • 6