Comparing massive size of data between clusters

Asked Aug 21 '18 at 03:11

Active Aug 21 '18 at 06:17

Viewed 88 times

Our team is in migrating a old CDH cluster to a new CDH cluseter.

I am tasked with comparing data stored in non-kerberized cluster(old cluster) with data stored in kerberized cluster(new cluster).

kerberized cluster is working on isilon. non-kerberized cluster is working on normal linux.

Both clusters have same python programs to put files into a cluster for hive analysis.

File size is approximately 45GB per partition respectively.

Now, I want to prove data put by each python program is the same by comparing them using methods like md5, etc.

Of course the same programs output the same result. But our concern about garbled , something unpredictable data loss or file size is the same but value is different.

Is there some ways to compare such a large-size data?

edited Aug 21 '18 at 06:17

asked Aug 21 '18 at 03:11

Yuki Saito

Comparing massive size of data between clusters

0 Answers0