0

I have two dataframes and when I union them, I got less rows/counts.

col_names = ["city", "name"]
print(df1.select(col_names)).count()) # 50
print(df2.select(col_names)).count()) # 60
df1.select(col_names).union(df2.select(col_names)).count() # 105 
df1.select(col_names).unionByName(df2.select(col_names)).count() # 105

But if I recreate both dataframes, everything is right

col_names = ["city", "name"]
df1_new = spark.createDataFrame(df1.select(col_names).head(50))
df2_new = spark.createDataFrame(df2.select(col_names).head(60))
print(df1_new.select(col_names)).count()) # 50
print(df2_new.select(col_names)).count()) # 60
df1_new.union(df2_new).count() # 110
df1_new.unionByName(df2_new).count() # 110

I also tried pandas by using pd.concat and it also worked. So it really confuses me.
Could anyone shed some light on why this happens?

derek
  • 9,358
  • 11
  • 53
  • 94
  • Hi derek, I agree that this is not the expected behavior and it probably needs more investigation on the content of the dataframes. Did you check `df1`, `df2` and their union with `show` and `printSchema`? And the usual question: Are you able to strip this observation down to a reproducible example that you can share? – Markus Apr 25 '22 at 12:40
  • @markus I've already filtered the columns of interest, i.e. `city` and `name`. So their schema is consistent and I also show the two columns and I do not see anything special. As I mentioned in the 2nd example, the issue disappeared when I recreated the DF. So I really have no idea where to go. – derek Apr 25 '22 at 16:27
  • Maybe... can you extract which 5's are missing from the unioned df in the above example? anything special about the 5? – Emma Apr 25 '22 at 17:47
  • the weird thing is the union output has elements that does not exist on df1 or df2. I am completely confused. – derek Apr 26 '22 at 00:39
  • Perhaps check the content of your selects with `hex` for any special characters? – Markus Apr 26 '22 at 09:29
  • 1
    Can you show the sample that you can demonstrate "union output has elements that does not exist on df1 or df2"? Also, what operation that you have before these lines, any dynamic random calculation? – Emma Apr 26 '22 at 14:59
  • Are you reusing df1 as a variable anywhere else? – Matt Andruff Apr 28 '22 at 17:30
  • @derek - What version of Spark are you using? Also, maybe you could share `df1.explain()` and `df2.explain()` results? Also unionized `df.explain()` – ZygD Apr 29 '22 at 08:54
  • Could something like [this](https://stackoverflow.com/a/32191266/11865956) be happening? – BeRT2me Apr 30 '22 at 04:23
  • Since recreating works, it means, there's something in the history how `df1` and `df2` were created. This is why we ask for results of `.explain()` – ZygD Apr 30 '22 at 16:01
  • do you have NaN values ? – Charfeddine Mohamed Ali May 02 '22 at 04:36

1 Answers1

0

Run an explain on

df1_new.union(df2_new).explain()

and

df1.select(col_names).union(df2.select(col_names)).explain()

It will show you what the difference between these two sets.

Matt Andruff
  • 4,974
  • 1
  • 5
  • 21