7

I am trying to compare two pandas dataframes but I get an error as 'DataFrame' object has no attribute 'withColumn'. What could be the issue?

import pandas as pd
import pyspark.sql.functions as F

pd_df=pd.DataFrame(df.dtypes,columns=['column','data_type'])
pd_df1=pd.DataFrame(df1.dtypes,columns=['column','data_type'])

pd.merge(pd_df,pd_df1, on='column', how='outer'
    ).withColumn(
    "result",
    F.when(F.col("data_type_x") == 'NaN','new attribute'.otherwise('old attribute')))
    .select(
    "column",
    "data_type_x",
    "data_type_y",
    "result"
    )

df and df1 are some data frames

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
jakrm
  • 183
  • 2
  • 3
  • 11

3 Answers3

4

You mixed up pandas dataframe and Spark dataframe.

The issue is pandas df doesn't have spark function withColumn.

Ani Menon
  • 27,209
  • 16
  • 105
  • 126
1

I figured it out. Thanks for the help.

def res(df):
    if df['data_type_x'] == df['data_type_y']:
        return 'no change'
    elif pd.isnull(df['data_type_x']):
        return 'new attribute'
    elif pd.isnull(df['data_type_y']):
        return 'deleted attribute'
    elif df['data_type_x'] != df['data_type_y'] and not pd.isnull(df['data_type_x']) and not pd.isnull(df['data_type_y']):
        return 'datatype change'

pd_merge['result'] = pd_merge.apply(res, axis = 1)
jakrm
  • 183
  • 2
  • 3
  • 11
0

Because you are setting these up as Pandas DataFrames and not Spark DataFrames. For joins with Pandas DataFrames, you would want to use

DataFrame_output = DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)

Run this to understand what DataFrame it is.

type(df)

To use withColumn, you would need Spark DataFrames. If you want to convert the DataFrames, use this:

import pyspark
from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder.appName('pandasToSparkDF').getOrCreate()
df = spark.createDataFrame(pd_df1)
Rob
  • 468
  • 3
  • 15
  • Thanks for your answer. Is it possible to add a new column (with above logic) to Pandas Dataframe without converting to Spark DataFrame? – jakrm Jul 11 '19 at 12:35
  • Yes, I would look at this https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html and https://stackoverflow.com/questions/26886653/pandas-create-new-column-based-on-values-from-other-columns – Rob Jul 11 '19 at 13:26