I have the following spark dataframes. One is derived from a text file while the other is derived from a Spark table in Databricks:
Despite the data being exactly the same, the following code reports differences. I expect df3 to be empty:
table_df = spark.sql("select * from db.table1")
file_df = spark.read.format("csv").load("my_file.txt", header = False, delimiter = '|')
file_df = file_df.toPandas()
table_df = table_df.toPandas()
df3=table_df.eq(file_df)
print(df3.shape[0])
- Do I need to order the data before comparison? - If so how do I do that?
- I cant see where a join is done in the above. How will it match rows? [ID] and [Account] are primary keys?
- Is the above the best way to compare 2 dataframes?
Here is the data - where [ID] and [Account] are primary keys