How to write data in the dataframe into single .parquet file(both data & metadata in single file) in HDFS?

Question

df.show() --> 2 rows
+------+--------------+----------------+
| name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa| null| [3, 9, 15, 20]| | Ben| red| []|
+------+--------------+----------------+

df.rdd.getNumPartitions() - it has 1 partition

>>> df.rdd.getNumPartitions()

1

df.write.save("/user/hduser/data_check/test.parquet", format="parquet")

If I use the above command to create parquet file in HDFS, it is creating directory "payloads.parquet" in HDFS and inside that directory multiple files .parquet file, metadata file are getting saved.

Found 4 items

-rw-r--r-- 3 bimodjoul biusers 0 2017-03-15 06:47 
/user/hduser/data_check/test.parquet/_SUCCESS 
-rw-r--r-- 3 bimodjoul biusers 494 2017-03-15 06:47
/user/hduser/data_check/test.parquet/_common_metadata
-rw-r--r-- 3 bimodjoul biusers 862 2017-03-15 06:47
/user/hduser/data_check/test.parquet/_metadata 
-rw-r--r-- 3 bimodjoul biusers 885 2017-03-15 06:47
/user/hduser/data_check/test.parquet/part-r-00000-f83a2ffd-38bb-4c76-9f4c-357e43d9708b.gz.parquet

How to write data in the dataframe into single .parquet file(both data & metadata in single file) in HDFSrather than folder with multiple files?

Help would be much appreciated.

why do you need one file? if you need it just to move it along then use the .gz.parquet file as it should have everything you need. The other files are generated in the process for various things. — Assaf Mendelson, Mar 15 '17 at 07:57
Hi @Ashish Singh, I have tried below two commands, df.coalesce(1).write.save("/user/hduser/data_check/test_3.parquet", format="parquet"); df.coalesce(1).write.parquet("/user/hduser/data_check/test_4.parquet"); These commands are also saving or writing as directory with parquet data file and metadata files. — Shiva Ram, Mar 15 '17 at 09:32
Like this: hadoop fs -ls /user/hduser/data_check/test_3.pa‌rquet Found 4 items -rw-r--r-- 3 bimodjoul biusers 0 2017-03-15 09:02 /user/hduser/data_check/test_3.pa‌rquet/_SUCCESS -rw-r--r-- 3 bimodjoul biusers 494 2017-03-15 09:02 /user/hduser/data_check/test_3.pa‌rquet/_common_metadata -rw-r--r-- 3 bimodjoul biusers 862 2017-03-15 09:02 /user/hduser/data_check/test_3.pa‌rquet/_metadata -rw-r--r-- 3 bimodjoul biusers 885 2017-03-15 09:02 /user/hduser/data_check/test_3.pa‌rquet/part-r-00000-6593ef9d-45c1-49a3-9b23-a783a9075c24.gz.parquet — Shiva Ram, Mar 15 '17 at 09:39
@ShivaRam did this answer your question, if yes please respond with the solution if you have — Srinathji Kyadari, Jun 21 '20 at 13:23

sande · Answer 1 · 2022-01-31T15:54:16.507

0

Use coalesce(1) after write. it will solve your issue

df.coalesce(1).write

edited Jan 31 '22 at 15:54

answered May 24 '18 at 21:09

sande

567
1
10
24

2

Your call order didn't work for me. I had to do `df.coalesce(1).write` – Gibado Jan 16 '19 at 21:04

score 0 · Answer 2 · edited Jul 27 '20 at 16:18

0

This should solve the problem.

df.coalesce(1).write.parquet(parquet_file_path)
df.write.mode('append').parquet("/tmp/output/people.parquet")

edited Jul 27 '20 at 16:18

SRIDHARAN

1,196
1
15
35

answered Jul 27 '20 at 13:52

Hemlata Chelwani

1

How to write data in the dataframe into single .parquet file(both data & metadata in single file) in HDFS?

2 Answers2