6

How to write data in the dataframe into single .parquet file(both data & metadata in single file) in HDFS?

df.show() --> 2 rows
+------+--------------+----------------+
| name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa| null| [3, 9, 15, 20]| | Ben| red| []|
+------+--------------+----------------+

df.rdd.getNumPartitions() - it has 1 partition

>>> df.rdd.getNumPartitions()

1

df.write.save("/user/hduser/data_check/test.parquet", format="parquet")

If I use the above command to create parquet file in HDFS, it is creating directory "payloads.parquet" in HDFS and inside that directory multiple files .parquet file, metadata file are getting saved.

Found 4 items

-rw-r--r-- 3 bimodjoul biusers 0 2017-03-15 06:47 
/user/hduser/data_check/test.parquet/_SUCCESS 
-rw-r--r-- 3 bimodjoul biusers 494 2017-03-15 06:47
/user/hduser/data_check/test.parquet/_common_metadata
-rw-r--r-- 3 bimodjoul biusers 862 2017-03-15 06:47
/user/hduser/data_check/test.parquet/_metadata 
-rw-r--r-- 3 bimodjoul biusers 885 2017-03-15 06:47
/user/hduser/data_check/test.parquet/part-r-00000-f83a2ffd-38bb-4c76-9f4c-357e43d9708b.gz.parquet

How to write data in the dataframe into single .parquet file(both data & metadata in single file) in HDFSrather than folder with multiple files?

Help would be much appreciated.

Indrajit Swain
  • 1,505
  • 1
  • 15
  • 22
Shiva Ram
  • 61
  • 1
  • 4
  • 1
    use coalesce(1) to get single file – Ashish Singh Mar 15 '17 at 07:44
  • why do you need one file? if you need it just to move it along then use the .gz.parquet file as it should have everything you need. The other files are generated in the process for various things. – Assaf Mendelson Mar 15 '17 at 07:57
  • Hi @Ashish Singh, I have tried below two commands, df.coalesce(1).write.save("/user/hduser/data_check/test_3.parquet", format="parquet"); df.coalesce(1).write.parquet("/user/hduser/data_check/test_4.parquet"); These commands are also saving or writing as directory with parquet data file and metadata files. – Shiva Ram Mar 15 '17 at 09:32
  • Like this: hadoop fs -ls /user/hduser/data_check/test_3.pa‌​rquet Found 4 items -rw-r--r-- 3 bimodjoul biusers 0 2017-03-15 09:02 /user/hduser/data_check/test_3.pa‌​rquet/_SUCCESS -rw-r--r-- 3 bimodjoul biusers 494 2017-03-15 09:02 /user/hduser/data_check/test_3.pa‌​rquet/_common_metadata -rw-r--r-- 3 bimodjoul biusers 862 2017-03-15 09:02 /user/hduser/data_check/test_3.pa‌​rquet/_metadata -rw-r--r-- 3 bimodjoul biusers 885 2017-03-15 09:02 /user/hduser/data_check/test_3.pa‌​rquet/part-r-00000-6593ef9d-45c1-49a3-9b23-a783a9075c24.gz.parquet – Shiva Ram Mar 15 '17 at 09:39
  • @ShivaRam did this answer your question, if yes please respond with the solution if you have – Srinathji Kyadari Jun 21 '20 at 13:23

2 Answers2

0

Use coalesce(1) after write. it will solve your issue

df.coalesce(1).write
sande
  • 567
  • 1
  • 10
  • 24
0

This should solve the problem.

df.coalesce(1).write.parquet(parquet_file_path)
df.write.mode('append').parquet("/tmp/output/people.parquet")
SRIDHARAN
  • 1,196
  • 1
  • 15
  • 35