0

I have a set of TSV files on HDFS structured like this:

g1  a
g1  b
g1  c
g2  a
g2  x
g2  y
g3  b
g3  d
...

I'd like to convert these files into files called hdfs:///tmp/g1.tsv, hdfs:///tmp/g2.tsv, and hdfs:///tmp/g3.tsv such that...

g1.tsv looks like:

a
b
c

g2.tsv looks like:

a
x

g3.tsv looks like:

b
d

etc.

These files are large, and I'd like to do the renaming as parallel as possible. Is there a simple MapReduce job, Spark Job, or an HDFS file operation for doing this?

Michael K
  • 2,196
  • 6
  • 34
  • 52
  • with spark, you can use df.partitionBy('column_one'), then save to hdfs. – DennisLi Jul 14 '19 at 13:04
  • 1
    I'm not sure if this is what you want. https://stackoverflow.com/questions/41663985/spark-dataframe-how-to-efficiently-split-dataframe-for-each-group-based-on-same – DennisLi Jul 14 '19 at 13:04

0 Answers0