How to merge dataframes in Databricks notebook using Python / Pyspark

Question

I'm using a Databricks notebook to extract gz-zipped csv files and loading into a dataframe object. I'm having trouble with part 2 below.

Part 1: Loading zipped files into dataframe is running fine...

    %python
    df1 = spark.read.option("header",True).option("delimiter", "|").csv("dbfs:/model/.../file_1.csv.gz")
    df2 = spark.read.option("header",True).option("delimiter", "|").csv("dbfs:/model/.../file_2.csv.gz")

Part 2: Trying to merge the dataframes...

    %python
    import pandas as pd
    df =pd.concat([df1, df2], ignore_index=True)
    df.show(truncate=False)

... returns the following error:

TypeError: cannot concatenate object of type '<class 'pyspark.sql.dataframe.DataFrame'>'; only Series and DataFrame objs are valid

Any suggestions for trying to modify how I'm merging the dataframes? I will have up to 20 files to merge, where all columns are the same.

Re-trying now by converting the objects to Pandas dataframes: df1 = df1.select("*").toPandas().... hoping this works, but objects are large and cluster is small'ish. TBD soon i hope. — Isolated, Feb 10 '21 at 16:45
concat is supported in pandas but not in spark, you should explore `df1.union(df2)` instead. pandas and pyspark arent the same, `.toPandas()` might work but its not efficient/might not fit to memory when your df size gets big — anky, Feb 10 '21 at 16:49
for multiple dfs try reduce `from pyspark.sql import DataFrame` and then `reduce(DataFrame.unionAll, [df1,df2,df3])` — anky, Feb 10 '21 at 16:55
related [spark unionAll multiple dataframes](https://stackoverflow.com/questions/37612622/spark-unionall-multiple-dataframes) — anky, Feb 10 '21 at 16:57

sophocles · Accepted Answer · 2021-02-10T19:10:05.263

6

If objects are large I don't think the best approach is to convert from pyspark to pandas, but to perform the equivalent of concat in pyspark.

Note that unionAll() is deprecated since Spark “2.0.0” version and replaced with union() https://sparkbyexamples.com/pyspark/pyspark-union-and-unionall/

I believe you could do:

from functools import reduce
from pyspark.sql import DataFrame

dfs = [df1,df2]
merged = reduce(DataFrame.union, dfs)

And of course to view it:

merged.show(truncate=False) # or display(merged)

edited Feb 10 '21 at 19:10

answered Feb 10 '21 at 16:55

sophocles

13,593
3
14
33

1

You were all correct, converting to pandas failed miserably. The solution above works great, thank you. – Isolated Feb 10 '21 at 18:13

How to merge dataframes in Databricks notebook using Python / Pyspark

1 Answers1