0

While using partitionby() in pyspark, what approach should I follow to write csv files in one single folder rather than multiple folders ? Any suggested solution ?

Code

from pyspark.sql import SparkSession
from pyspark import SparkConf
import pyodbc

appName = "PySpark Teradata Example"
master = "local"
conf = SparkConf()  # create the configuration
conf.set("spark.repl.local.jars", "terajdbc4.jar")
conf.set("spark.executor.extraClassPath", "terajdbc4.jar")
conf.set("spark.driver.extraClassPath", "terajdbc4.jar")

spark = SparkSession.builder \
    .config(conf=conf) \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

#input table name
table = "my_table_1"

df =spark.read \
        .format('jdbc') \
        .option('url', 'jdbc:teradata://xxx.xxx.xx.xx') \
        .option('user', 'dbc') \
        .option('password', 'dbc') \
        .option('driver', 'com.teradata.jdbc.TeraDriver') \
        .option('STRICT_NAMES', 'OFF') \
        .option('query',"Select eno, CAST(edata.asJSONText() AS VARCHAR(32000)) as edata from AdventureWorksDW."+table)\
        .load()

df.show()
df = df.withColumn("id_tmp", F.col(df.columns[0]) % 4).orderBy("id_tmp")
df.coalesce(4)
 .write \
.option("header",True) \
 .mode("overwrite") \
 .partitionBy("id_tmp") \
    .option("sep","|")\
 .format("csv") \
 .save("C:\\Data\\"+table+"\\")

It is giving multiple folders with multiple CSV as an output. How to download it to a single folder ? Also, how can we change the name of the file while writing it to the folder ?

starlord
  • 135
  • 9
  • Why do you `partitionBy` if you want one folder? – blackbishop Jan 31 '22 at 09:25
  • @blackbishop I have updated the code by using partitionBY, however, the problem remains the same. Multiple folders are created with multiple csv files. How to keep that in one single folder? For example : My Account directory has 5 sub folders with part file and csv file being created in each sub folders. Any idea on what should be the best approach to write it in a single folder ? – starlord Jan 31 '22 at 10:01
  • 1
    When you use `.partitionBy("id_tmp")` it's normal to have multiple sub folders. each one corresponds to a partition value of `id_tmp` column. So again if you want only one folder do not partition by while writing. – blackbishop Jan 31 '22 at 10:03
  • @blackbishop I am using coalesce and repartition function, however it seems ineffective. I have updated the code but still the problem persist that without the "id_tmp" column I am not able to partition the file. – starlord Feb 01 '22 at 04:32

1 Answers1

2

df = df.repartition(1) will reset the amount of partitions to 1, but as Kafels mentioned, it is better to use coalesce: df = df.coalesce(1)

more info:
https://stackoverflow.com/a/31675351
https://stackoverflow.com/a/40983145

source: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.repartition.html
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.coalesce.html

AutomatedChaos
  • 7,267
  • 2
  • 27
  • 47
  • `repartition` will trigger shuffle operation, use `coalesce` instead – Kafels Jan 31 '22 at 10:17
  • @Kafels good on! I found an in depth explanation [here](https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce#:~:text=coalesce%20uses%20existing%20partitions%20to,in%20roughly%20equal%20sized%20partitions.) – AutomatedChaos Jan 31 '22 at 10:21
  • @Kafels I have updated the code using coalesce but I am not getting the desired output. Without "id_tmp" I am not able to partition the file. – starlord Feb 01 '22 at 04:35
  • @AutomatedChaos I have updated the code using coalesce but I am not getting the desired output. Without "id_tmp" I am not able to partition the file. – starlord Feb 01 '22 at 04:37