-2

I have the following spark dataframes. One is derived from a text file while the other is derived from a Spark table in Databricks:

Despite the data being exactly the same, the following code reports differences. I expect df3 to be empty:

table_df = spark.sql("select * from db.table1")
file_df = spark.read.format("csv").load("my_file.txt", header = False, delimiter = '|')
file_df = file_df.toPandas()
table_df = table_df.toPandas()
df3=table_df.eq(file_df)
print(df3.shape[0])
  • Do I need to order the data before comparison? - If so how do I do that?
  • I cant see where a join is done in the above. How will it match rows? [ID] and [Account] are primary keys?
  • Is the above the best way to compare 2 dataframes?

Here is the data - where [ID] and [Account] are primary keys

enter image description here

ibexy
  • 609
  • 3
  • 16
  • 34

1 Answers1

0

What I usually do to compare two dataframe is using doing an inner join and see if the count match, or use subtract to see if there is any different between two of them

df1 = create_df(
    data=[
        ('Avery Bradley', 25.0, 7730337.0),
        ('Jae Crowder', 25.0, 6796117.0),
    ],
    schema = ['name', 'age', 'salary']
)
+-------------+----+---------+
|         name| age|   salary|
+-------------+----+---------+
|Avery Bradley|25.0|7730337.0| <<< 25
|  Jae Crowder|25.0|6796117.0|
+-------------+----+---------+

df2 = create_df(
    data=[
        ('Avery Bradley', 24.0, 7730337.0),
        ('Jae Crowder', 25.0, 6796117.0),
    ],
    schema = ['name', 'age', 'salary']
)
+-------------+----+---------+
|         name| age|   salary|
+-------------+----+---------+
|Avery Bradley|24.0|7730337.0| <<< 24
|  Jae Crowder|25.0|6796117.0|
+-------------+----+---------+

# Solution #1
df1.subtract(df2).show()
+-------------+----+---------+
|         name| age|   salary|
+-------------+----+---------+
|Avery Bradley|25.0|7730337.0|
+-------------+----+---------+

# Solution #2
df1.join(df2, on=df1.columns).count()
# 1 <<< while df1.count() = 2
pltc
  • 5,836
  • 1
  • 13
  • 31
  • Thanks for your prompt response. Do you have sample code please? – ibexy May 25 '21 at 18:28
  • I used solution #1. Now I see that the following line in my code converts the ID into decimal and that doesn't match with the whole number coming for the same column in the table: file_df = spark.read.format("csv").load("my_file.txt", header = False, delimiter = '|') – ibexy May 25 '21 at 18:38
  • You always can force schema when reading from csv – pltc May 25 '21 at 19:11
  • You can check [this](https://stackoverflow.com/a/51762911/3441510) answer – pltc May 25 '21 at 19:54