Spark: Write each record in RDD to individual files in HDFS directory

Question

I have a requirement where I want to write each individual records in an RDD to an individual file in HDFS.

I did it for the normal filesystem but obviously, it doesn't work for HDFS.

stream.foreachRDD{ rdd =>
    if(!rdd.isEmpty()) {
        rdd.foreach{
          msg =>
            val value = msg._2
            println(value)
            val fname = java.util.UUID.randomUUID.toString
            val path = dir + fname
            write(path, value)
        }
      }
    }

where write is a function which writes to the filesystem.

Is there a way to do it within spark so that for each record I can natively write to the HDFS, without using any other tool like Kafka Connect or Flume??

EDIT: More Explanation

For eg: If my DstreamRDD has the following records,

abcd
efgh
ijkl
mnop

I need different files for each record, so different file for "abcd", different for "efgh" and so on.

I tried creating an RDD within the streamRDD but I learnt it's not allowed as the RDD's are not serializable.

can you please post the working solution or accept the correct one. It helps other people who have similar issue. — Explorer, May 13 '17 at 13:58
@LiveAndLetLive I didn't find a solution to this problem yet, and as I mentioned in one of the previous comment, we moved from storing record to storing the entire RDD with multiple record. So, this question is still open. — Biplob Biswas, May 15 '17 at 14:47
you may use your own MultipleTextOutputFormat, see this reply: https://stackoverflow.com/a/26051042/609597 — softwarevamp, Aug 18 '17 at 08:20

score 0 · Answer 1 · answered Feb 14 '17 at 20:44

0

You can forcefully repartition the rdd to no. of partitions as many no. of records and then save

val rddCount = rdd.count()
rdd.repartition(rddCount).saveAsTextFile("your/hdfs/loc")

answered Feb 14 '17 at 20:44

Shasankar

672
6
16

score -1 · Answer 2 · answered Feb 14 '17 at 16:05

-1

You can do in couple of ways..

From rdd, you can get the sparkCOntext, once you got the sparkCOntext, you can use parallelize method and pass the String as List of String.

For example:

val sc = rdd.sparkContext
sc.parallelize(Seq("some string")).saveAsTextFile(path)

Also, you can use sqlContext to convert the string to DF then write in the file.

for Example:

import sqlContext.implicits._
Seq(("some string")).toDF

answered Feb 14 '17 at 16:05

Shankar

8,529
26
90
159

My data is within rdd, so I can't just create rdd the way you specified as nesting of rdd is not allowed. – Biplob Biswas Feb 14 '17 at 17:18
shankars approach seems right to me. @BiplobBiswas what else you tried were you able to resolve? – Ram Ghadiyaram Feb 28 '17 at 12:38
@RamGhadiyaram We moved to saving the entire RDD to HDFS, although saving individual records as separate files would have solved our future issues. – Biplob Biswas Mar 20 '17 at 13:10
@Shankar have you got the answer, could you please share with other? – lucy Oct 22 '17 at 12:11

Spark: Write each record in RDD to individual files in HDFS directory

2 Answers2