I have two dataframes and when I union them, I got less rows/counts.
col_names = ["city", "name"]
print(df1.select(col_names)).count()) # 50
print(df2.select(col_names)).count()) # 60
df1.select(col_names).union(df2.select(col_names)).count() # 105
df1.select(col_names).unionByName(df2.select(col_names)).count() # 105
But if I recreate both dataframes, everything is right
col_names = ["city", "name"]
df1_new = spark.createDataFrame(df1.select(col_names).head(50))
df2_new = spark.createDataFrame(df2.select(col_names).head(60))
print(df1_new.select(col_names)).count()) # 50
print(df2_new.select(col_names)).count()) # 60
df1_new.union(df2_new).count() # 110
df1_new.unionByName(df2_new).count() # 110
I also tried pandas by using pd.concat
and it also worked. So it really confuses me.
Could anyone shed some light on why this happens?