Load DataFrame as Text File into HDFS and S3

Question

I am trying to load a DataFrame into HDFS and S3 as text format file using below code. DataFrame name is finalData.

val targetPath = "/user/test/File"
val now = Calendar.getInstance().getTime()
val formatter = new SimpleDateFormat("yyyyMMddHHmmss")
val timeStampAfterFormatting = formatter.format(now)
val targetFile = s"""$targetPath/test_$timeStampAfterFormatting.txt"""
finalData.repartition(1).rdd.saveAsTextFile(targetFile)

Using above code I can load the Data successfully. But file name is not same as I provided and also not in text format. A directory has created with the name as I mentioned.

Directory Name - /user/test/File/test_20170918055206.txt

-bash-4.2$ hdfs dfs -ls /user/test/File/test_20170918055206.txt

Found 2 items

/user/test/File/test_20170918055206.txt/_SUCCESS

/user/test/File/test_20170918055206.txt/part-00000

I want to create the file as I mentioned instead of creating the directory. Can anyone please assist me.

You have pre-defined methods to write dataframes to hive tables. Did you check those out? — philantrovert, Sep 18 '17 at 13:21
@philantrovert Yes, OP can: 1) Follow advices from answers linked by me 2) Save to Hive — T. Gawęda, Sep 18 '17 at 13:31
It looks from your example like this is an exact duplicate. Are we failing to understand what you are doing? — Chris Travers, Sep 26 '17 at 17:41

ashburshui · Accepted Answer · 2017-09-18T13:40:24.293

In my opinion, this is working as design.

You got a repartition operation just before you saved your rdd data, and this would trigger a shuffle operation among your whole data, and eventually got a new rdd which had only one partition.

So this only one partition was stored in your HDFS as your saveAsTextFile operation.

This method is designed such way to let arbitrary number of partitions to be writed in a uniform way.

For example, if your rdd has 100 partitions, no coalesce or repartition before write to HDFS. Then you will get a directory include _SUCCESS flag and 100 files!

if this method is not designed such way, how rdd with multiple partitions could be stored in a concise, uniform and elegant way, and maybe user need to direct the multiple file names? ...ah, so tedious maybe

I hope this explanation helps you.

If you then need and a complete whole file on your local file system, just try the hadoop client command:

hadoop fs -getmerge [src] [des]

Load DataFrame as Text File into HDFS and S3

1 Answers1