0

I am trying to load a DataFrame into HDFS and S3 as text format file using below code. DataFrame name is finalData.

val targetPath = "/user/test/File"
val now = Calendar.getInstance().getTime()
val formatter = new SimpleDateFormat("yyyyMMddHHmmss")
val timeStampAfterFormatting = formatter.format(now)
val targetFile = s"""$targetPath/test_$timeStampAfterFormatting.txt"""
finalData.repartition(1).rdd.saveAsTextFile(targetFile)

Using above code I can load the Data successfully. But file name is not same as I provided and also not in text format. A directory has created with the name as I mentioned.

Directory Name - /user/test/File/test_20170918055206.txt

-bash-4.2$ hdfs dfs -ls /user/test/File/test_20170918055206.txt

Found 2 items

/user/test/File/test_20170918055206.txt/_SUCCESS

/user/test/File/test_20170918055206.txt/part-00000

I want to create the file as I mentioned instead of creating the directory. Can anyone please assist me.

Community
  • 1
  • 1
Avijit
  • 1,770
  • 5
  • 16
  • 34

1 Answers1

1

In my opinion, this is working as design.

You got a repartition operation just before you saved your rdd data, and this would trigger a shuffle operation among your whole data, and eventually got a new rdd which had only one partition.

So this only one partition was stored in your HDFS as your saveAsTextFile operation.

This method is designed such way to let arbitrary number of partitions to be writed in a uniform way.

For example, if your rdd has 100 partitions, no coalesce or repartition before write to HDFS. Then you will get a directory include _SUCCESS flag and 100 files!

if this method is not designed such way, how rdd with multiple partitions could be stored in a concise, uniform and elegant way, and maybe user need to direct the multiple file names? ...ah, so tedious maybe

I hope this explanation helps you.


If you then need and a complete whole file on your local file system, just try the hadoop client command:

hadoop fs -getmerge [src] [des]

ashburshui
  • 1,400
  • 10
  • 12