HDFS TSV to named file

Asked Jul 12 '19 at 22:29

Active Jul 12 '19 at 22:36

Viewed 20 times

I have a set of TSV files on HDFS structured like this:

g1  a
g1  b
g1  c
g2  a
g2  x
g2  y
g3  b
g3  d
...

I'd like to convert these files into files called hdfs:///tmp/g1.tsv, hdfs:///tmp/g2.tsv, and hdfs:///tmp/g3.tsv such that...

g1.tsv looks like:

a
b
c

g2.tsv looks like:

a
x

g3.tsv looks like:

b
d

etc.

These files are large, and I'd like to do the renaming as parallel as possible. Is there a simple MapReduce job, Spark Job, or an HDFS file operation for doing this?

edited Jul 12 '19 at 22:36

asked Jul 12 '19 at 22:29

Michael K

2,196
6
34
52

with spark, you can use df.partitionBy('column_one'), then save to hdfs. – DennisLi Jul 14 '19 at 13:04
1

I'm not sure if this is what you want. https://stackoverflow.com/questions/41663985/spark-dataframe-how-to-efficiently-split-dataframe-for-each-group-based-on-same – DennisLi Jul 14 '19 at 13:04

HDFS TSV to named file

0 Answers0