I have a set of TSV files on HDFS structured like this:
g1 a
g1 b
g1 c
g2 a
g2 x
g2 y
g3 b
g3 d
...
I'd like to convert these files into files called hdfs:///tmp/g1.tsv
, hdfs:///tmp/g2.tsv
, and hdfs:///tmp/g3.tsv
such that...
g1.tsv
looks like:
a
b
c
g2.tsv
looks like:
a
x
g3.tsv
looks like:
b
d
etc.
These files are large, and I'd like to do the renaming as parallel as possible. Is there a simple MapReduce job, Spark Job, or an HDFS file operation for doing this?