-1

I am having issues putting together a query that filters an existing dataframe to show the count of names that are the same for both male and females.

We assigned a name to be female if the number of women and men were equal. Write a filter based on the df_ssa5 DataFrame to count and print out how many times this occurs and how many names there are in total.

With df_ssa5 being a given dataframe; 
df_ssa5 = df_ssa4.groupBy("name").sum("F","M").withColumnRenamed("sum(F)","women").withColumnRenamed("sum(M)","men")

df_ssa5.show()

If anyone could help that'd be great.

The desired output would be something like this, however with the same names and the count of time the names shows up that are both for men and women:

enter image description here

Prasad Khode
  • 6,602
  • 11
  • 44
  • 59
Nat
  • 11
  • 1
  • 4
  • can you add dataframe value in the question? – Kishore Sep 11 '18 at 06:28
  • @Kishore yes you can add dataframe value. – Nat Sep 11 '18 at 06:28
  • I am asking you to add input and desired output in the questions – Kishore Sep 11 '18 at 06:30
  • I'm new to spark, trying to figure out how to write a comparison/equal for both genders by name – Nat Sep 11 '18 at 06:30
  • I'm sorry I'm not understanding what're you saying – Nat Sep 11 '18 at 06:30
  • add sample input and expected output along with the code what you have tried so far... – Prasad Khode Sep 11 '18 at 06:32
  • Added a pic of the sample output – Nat Sep 11 '18 at 06:38
  • So you want the count of the rows where the value in the women column equals men? So for the added data the result should be 0? – Shaido Sep 11 '18 at 09:23
  • Welcome to [tag:apache-spark] on Stack Overflow. Please read carefully [How to make good reproducible Apache Spark Dataframe examples](https://stackoverflow.com/q/48427185), [edit] your question accordingly, and follow the guidelines in the future. Also don't forget about general suggestion [on how to ask](https://stackoverflow.com/q/48427185) or how to provide [mcve]. Additionally please avoid images of data / code / exceptions and such... – zero323 Sep 11 '18 at 10:32

1 Answers1

0

you need to use groupBy operations along with spark's aggregations something like below:

import org.apache.spark.sql.functions._
import sparkSession.sqlContext.implicits._

val df = Seq(("Soni", "F"),("Lorne", "F"),("Lorne", "F"), ("Bonnye", "F"),("Lorne", "M"),("Nimrit", "M")).toDF("name","gender")

val result = df.groupBy("name")
  .agg(
    sum(when($"gender" === "M", lit(1))).as("men_count"),
    sum(when($"gender" === "F", lit(1))).as("women_count")
  )

result.show(false)

result.filter($"men_count".isNotNull && $"women_count".isNotNull).show(false)

output:

+------+---------+-----------+
|name  |men_count|women_count|
+------+---------+-----------+
|Soni  |null     |1          |
|Lorne |1        |2          |
|Bonnye|null     |1          |
|Nimrit|1        |null       |
+------+---------+-----------+

+-----+---------+-----------+
|name |men_count|women_count|
+-----+---------+-----------+
|Lorne|1        |2          |
+-----+---------+-----------+
Prasad Khode
  • 6,602
  • 11
  • 44
  • 59
  • Would there be a way where it uses a filter? – Nat Sep 11 '18 at 07:17
  • yes you can use `filter` to filter out the records based on some conditions, check the updated answer – Prasad Khode Sep 11 '18 at 07:32
  • This is still giving me an error. The dataset of names is given to us, you don't have to create it. The output is from the code the I ran before. I would like to know how many times it shows up where a man and a woman have the same name, not the count of the names. If that makes sense. – Nat Sep 11 '18 at 07:42