How to get 1000 records from dataframe and write into a file using PySpark?

Question

I am having 100,000+ of records in dataframe. I want to create a file dynamically and push 1000 records per file. Can anyone help me to solve this, thanks in advance.

notNull · Accepted Answer · 2020-04-24T16:38:30.627

You can use maxRecordsPerFile option while writing dataframe.

If you need whole dataframe to write 1000 records in each file then use repartition(1) (or) write 1000 records for each partition use .coalesce(1)

Example:

# 1000 records written per file in each partition
df.coalesce(1).write.option("maxRecordsPerFile", 1000).mode("overwrite").parquet(<path>)

# 1000 records written per file for dataframe 100 files created for 100,000
df.repartition(1).write.option("maxRecordsPerFile", 1000).mode("overwrite").parquet(<path>)

#or by set config on spark session
spark.conf.set("spark.sql.files.maxRecordsPerFile", 1000)
#or
spark.sql("set spark.sql.files.maxRecordsPerFile=1000").show()

df.coalesce(1).write.mode("overwrite").parquet(<path>)
df.repartition(1).write.mode("overwrite").parquet(<path>)

Method-2:

Caluculating number of partitions then repartition the dataframe:

df = spark.range(10000)

#caluculate partitions
no_partitions=df.count()/1000

from pyspark.sql.functions import *

#repartition and check number of records on each partition
df.repartition(no_partitions).\
withColumn("partition_id",spark_partition_id()).\
groupBy(col("partition_id")).\
agg(count("*")).\
show()

#+-----------+--------+
#|partiton_id|count(1)|
#+-----------+--------+
#|          1|    1001|
#|          6|    1000|
#|          3|     999|
#|          5|    1000|
#|          9|    1000|
#|          4|     999|
#|          8|    1000|
#|          7|    1000|
#|          2|    1001|
#|          0|    1000|
#+-----------+--------+

df.repartition(no_partitions).write.mode("overwrite").parquet(<path>)

@tsuresh97, by default we can't `control filenames` in spark but able to specify only **directory** names, to change filenames need to use hadoop.fs api for this case https://stackoverflow.com/questions/41990086/specifying-the-filename-when-saving-a-dataframe-as-a-csv — notNull, Apr 24 '20 at 18:16
I need to write record of dataframe to a json file. If I write the dataframe into the file it stores like {"a":1} {"b":2}, I want to write the dataframe like this [{"a":1} ,{"b":2}]. Can you please @Shu. Thanks in advance. — tsuresh97, Apr 25 '20 at 10:56
@tsuresh97, please check my answer https://stackoverflow.com/a/61425542/7632695 — notNull, Apr 25 '20 at 12:21

score 1 · Answer 2 · answered Apr 24 '20 at 16:06

1

Firstly, create a row number column

df = df.withColumn('row_num', F.row_number().over(Window.orderBy('any_column'))

Now, run a loop and keep saving the records.

for i in range(0, df.count(), 1000):
   records = df.where(F.col("row_num").between(i, i+999))
   records.toPandas().to_csv("file-{}.csv".format(i))

answered Apr 24 '20 at 16:06

Prateek Jain

547
1
5
16

I need to write record of dataframe to a json file. If I write the dataframe into the file it stores like {"a":1} {"b":2}, I want to write the dataframe like this [{"a":1} ,{"b":2}]. Can you please help me. Thanks in advance. @Prateek Jain – tsuresh97 Apr 25 '20 at 10:56
1

Using pandas we can also store in the Json format, which is what you require. So, just need to change .to_csv to .to_json. Refer to this link for details - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html – Prateek Jain Apr 25 '20 at 14:05
1

Thanks @Prateek Jain – tsuresh97 Apr 25 '20 at 17:12

How to get 1000 records from dataframe and write into a file using PySpark?

2 Answers2

Linked