I am new to Spark and I am trying to apply groupby
and count
to my dataframe df
on the users
attribute.
import pandas as pd
comments = [ (1, "Hi I heard about Spark"),
(1, "Spark is awesome"),
(2, None),
(2, "And I don't know why..."),
(3, "Blah blah")]
df = pd.DataFrame(comments )
df.columns = ["users", "comments"]
Which looks like this is pandas
users comments
0 1 Hi I heard about Spark
1 1 Spark is awesome
2 2 None
3 2 And I don't know why
4 3 Blah blah
I want to find the equivalent of the following pandas code for pyspark
df.groupby(['users'])['users'].transform('count')
The output looks like this:
0 2
1 2
2 2
3 2
4 1
dtype: int64
Could you help me how I can implement this in PySpark
?