How to filter a dataframe in JSON to show the count of names that are both male and female?

Question

I am having issues putting together a query that filters an existing dataframe to show the count of names that are the same for both male and females.

We assigned a name to be female if the number of women and men were equal. Write a filter based on the df_ssa5 DataFrame to count and print out how many times this occurs and how many names there are in total.

With df_ssa5 being a given dataframe; 
df_ssa5 = df_ssa4.groupBy("name").sum("F","M").withColumnRenamed("sum(F)","women").withColumnRenamed("sum(M)","men")

df_ssa5.show()

If anyone could help that'd be great.

The desired output would be something like this, however with the same names and the count of time the names shows up that are both for men and women:

I am asking you to add input and desired output in the questions — Kishore, Sep 11 '18 at 06:30
I'm new to spark, trying to figure out how to write a comparison/equal for both genders by name — Nat, Sep 11 '18 at 06:30
add sample input and expected output along with the code what you have tried so far... — Prasad Khode, Sep 11 '18 at 06:32
So you want the count of the rows where the value in the women column equals men? So for the added data the result should be 0? — Shaido, Sep 11 '18 at 09:23
Welcome to [tag:apache-spark] on Stack Overflow. Please read carefully [How to make good reproducible Apache Spark Dataframe examples](https://stackoverflow.com/q/48427185), [edit] your question accordingly, and follow the guidelines in the future. Also don't forget about general suggestion [on how to ask](https://stackoverflow.com/q/48427185) or how to provide [mcve]. Additionally please avoid images of data / code / exceptions and such... — zero323, Sep 11 '18 at 10:32

Prasad Khode · Answer 1 · 2018-09-11T07:32:08.207

0

you need to use groupBy operations along with spark's aggregations something like below:

import org.apache.spark.sql.functions._
import sparkSession.sqlContext.implicits._

val df = Seq(("Soni", "F"),("Lorne", "F"),("Lorne", "F"), ("Bonnye", "F"),("Lorne", "M"),("Nimrit", "M")).toDF("name","gender")

val result = df.groupBy("name")
  .agg(
    sum(when($"gender" === "M", lit(1))).as("men_count"),
    sum(when($"gender" === "F", lit(1))).as("women_count")
  )

result.show(false)

result.filter($"men_count".isNotNull && $"women_count".isNotNull).show(false)

output:

+------+---------+-----------+
|name  |men_count|women_count|
+------+---------+-----------+
|Soni  |null     |1          |
|Lorne |1        |2          |
|Bonnye|null     |1          |
|Nimrit|1        |null       |
+------+---------+-----------+

+-----+---------+-----------+
|name |men_count|women_count|
+-----+---------+-----------+
|Lorne|1        |2          |
+-----+---------+-----------+

edited Sep 11 '18 at 07:32

answered Sep 11 '18 at 06:59

Prasad Khode

6,602
11
44
59

Would there be a way where it uses a filter? – Nat Sep 11 '18 at 07:17
yes you can use `filter` to filter out the records based on some conditions, check the updated answer – Prasad Khode Sep 11 '18 at 07:32
This is still giving me an error. The dataset of names is given to us, you don't have to create it. The output is from the code the I ran before. I would like to know how many times it shows up where a man and a woman have the same name, not the count of the names. If that makes sense. – Nat Sep 11 '18 at 07:42

How to filter a dataframe in JSON to show the count of names that are both male and female?

1 Answers1