-1

I have data stored in parqueue format, i want to generate the delimitered text file from spark with row limit of 100 rows per file. is this possible to handle it from spark notebooks ? I am building ADF pipeline which triggers this notebook and the assume output is of textfile something like the below format please suggest me the possible ways .

5431732167 899 1011381 1 teststring 5431732163 899 912 teststring 5431932119 899 108808 40 teststring 5432032116 899 1082223 40 teststring

i also have a need to process the batch of text file and load them into database, please suggest the options to do this.

Thanks in advance.

Thanks, Manoj.

Manoj
  • 61
  • 6
  • Does this answer your question? [How to get 1000 records from dataframe and write into a file using PySpark?](https://stackoverflow.com/questions/61412292/how-to-get-1000-records-from-dataframe-and-write-into-a-file-using-pyspark) – Douglas M May 24 '20 at 22:14

2 Answers2

0

This question appears to be a functional duplicate of: How to get 1000 records from dataframe and write into a file using PySpark?

Before running your job to write your CSV files, set maxRecordsPerFile, so in Spark SQL:

set spark.sql.files.maxRecordsPerFile = 100
Douglas M
  • 1,035
  • 8
  • 17
0

You should be able to use maxRecordsPerFile with the CSV output. This will not guarantee that you will have only one file with less than 100 records though, only that there will be no files with more than 100 rows. Spark writes in parallel so this cannot be ensured across nodes.

df
  .write
  .option("maxRecordsPerFile", 100)
  .csv(outputPath)

If your data is very small, you can coalesce it to 1 partition and ensure that only 1 file is bigger than 100 rows, but then you loose the parallel processing speed advantage (most of your cluster will be unused during the last calculation and the write)

For writing to databases, the solution depends on the particular database. One example many databases support is JDBC, spark can read/write data with it, see: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

adamt06
  • 71
  • 4