why does union of two dataframes give less count

Question

I have two dataframes and when I union them, I got less rows/counts.

col_names = ["city", "name"]
print(df1.select(col_names)).count()) # 50
print(df2.select(col_names)).count()) # 60
df1.select(col_names).union(df2.select(col_names)).count() # 105 
df1.select(col_names).unionByName(df2.select(col_names)).count() # 105

But if I recreate both dataframes, everything is right

col_names = ["city", "name"]
df1_new = spark.createDataFrame(df1.select(col_names).head(50))
df2_new = spark.createDataFrame(df2.select(col_names).head(60))
print(df1_new.select(col_names)).count()) # 50
print(df2_new.select(col_names)).count()) # 60
df1_new.union(df2_new).count() # 110
df1_new.unionByName(df2_new).count() # 110

I also tried pandas by using pd.concat and it also worked. So it really confuses me.
Could anyone shed some light on why this happens?

Hi derek, I agree that this is not the expected behavior and it probably needs more investigation on the content of the dataframes. Did you check `df1`, `df2` and their union with `show` and `printSchema`? And the usual question: Are you able to strip this observation down to a reproducible example that you can share? — Markus, Apr 25 '22 at 12:40
@markus I've already filtered the columns of interest, i.e. `city` and `name`. So their schema is consistent and I also show the two columns and I do not see anything special. As I mentioned in the 2nd example, the issue disappeared when I recreated the DF. So I really have no idea where to go. — derek, Apr 25 '22 at 16:27
Maybe... can you extract which 5's are missing from the unioned df in the above example? anything special about the 5? — Emma, Apr 25 '22 at 17:47
the weird thing is the union output has elements that does not exist on df1 or df2. I am completely confused. — derek, Apr 26 '22 at 00:39
Perhaps check the content of your selects with `hex` for any special characters? — Markus, Apr 26 '22 at 09:29
Can you show the sample that you can demonstrate "union output has elements that does not exist on df1 or df2"? Also, what operation that you have before these lines, any dynamic random calculation? — Emma, Apr 26 '22 at 14:59
@derek - What version of Spark are you using? Also, maybe you could share `df1.explain()` and `df2.explain()` results? Also unionized `df.explain()` — ZygD, Apr 29 '22 at 08:54
Could something like [this](https://stackoverflow.com/a/32191266/11865956) be happening? — BeRT2me, Apr 30 '22 at 04:23
Since recreating works, it means, there's something in the history how `df1` and `df2` were created. This is why we ask for results of `.explain()` — ZygD, Apr 30 '22 at 16:01

Matt Andruff · Answer 1 · 2022-04-28T18:53:51.700

0

Run an explain on

df1_new.union(df2_new).explain()

and

df1.select(col_names).union(df2.select(col_names)).explain()

It will show you what the difference between these two sets.

edited Apr 28 '22 at 18:53

answered Apr 28 '22 at 17:32

Matt Andruff

4,974
1
5
21

why does union of two dataframes give less count

1 Answers1