1

I'm using a Databricks notebook to extract gz-zipped csv files and loading into a dataframe object. I'm having trouble with part 2 below.

Part 1: Loading zipped files into dataframe is running fine...

    %python
    df1 = spark.read.option("header",True).option("delimiter", "|").csv("dbfs:/model/.../file_1.csv.gz")
    df2 = spark.read.option("header",True).option("delimiter", "|").csv("dbfs:/model/.../file_2.csv.gz")
    

Part 2: Trying to merge the dataframes...

    %python
    import pandas as pd
    df =pd.concat([df1, df2], ignore_index=True)
    df.show(truncate=False)
    

... returns the following error:

TypeError: cannot concatenate object of type '<class 'pyspark.sql.dataframe.DataFrame'>'; only Series and DataFrame objs are valid

Any suggestions for trying to modify how I'm merging the dataframes? I will have up to 20 files to merge, where all columns are the same.

sophocles
  • 13,593
  • 3
  • 14
  • 33
Isolated
  • 5,169
  • 1
  • 6
  • 18
  • Re-trying now by converting the objects to Pandas dataframes: df1 = df1.select("*").toPandas().... hoping this works, but objects are large and cluster is small'ish. TBD soon i hope. – Isolated Feb 10 '21 at 16:45
  • 2
    concat is supported in pandas but not in spark, you should explore `df1.union(df2)` instead. pandas and pyspark arent the same, `.toPandas()` might work but its not efficient/might not fit to memory when your df size gets big – anky Feb 10 '21 at 16:49
  • 2
    for multiple dfs try reduce `from pyspark.sql import DataFrame` and then `reduce(DataFrame.unionAll, [df1,df2,df3])` – anky Feb 10 '21 at 16:55
  • 1
    related [spark unionAll multiple dataframes](https://stackoverflow.com/questions/37612622/spark-unionall-multiple-dataframes) – anky Feb 10 '21 at 16:57

1 Answers1

6

If objects are large I don't think the best approach is to convert from pyspark to pandas, but to perform the equivalent of concat in pyspark.

Note that unionAll() is deprecated since Spark “2.0.0” version and replaced with union() https://sparkbyexamples.com/pyspark/pyspark-union-and-unionall/

I believe you could do:

from functools import reduce
from pyspark.sql import DataFrame

dfs = [df1,df2]
merged = reduce(DataFrame.union, dfs)

And of course to view it:

merged.show(truncate=False) # or display(merged)
sophocles
  • 13,593
  • 3
  • 14
  • 33
  • 1
    You were all correct, converting to pandas failed miserably. The solution above works great, thank you. – Isolated Feb 10 '21 at 18:13