1

I am new to Spark and I am trying to apply groupby and count to my dataframe df on the users attribute.

import pandas as pd

comments = [ (1, "Hi I heard about Spark"),
  (1, "Spark is awesome"),
  (2, None),
  (2, "And I don't know why..."),
  (3, "Blah blah")]

df  = pd.DataFrame(comments )
df.columns = ["users", "comments"]

Which looks like this is pandas

       users                 comments
0      1   Hi I heard about Spark
1      1         Spark is awesome
2      2                     None
3      2  And I don't know why
4      3                Blah blah

I want to find the equivalent of the following pandas code for pyspark

df.groupby(['users'])['users'].transform('count') 

The output looks like this:

    0    2
1    2
2    2
3    2
4    1
dtype: int64

Could you help me how I can implement this in PySpark?

data_steve
  • 1,548
  • 12
  • 17
MomoPP
  • 599
  • 2
  • 6
  • 14

1 Answers1

1

This should work in pyspark : df.groupby('user').count() . In pyspark groupby() is an alias for groupBy() Pyspark docs are pretty easy reading with some good examples.

UPDATE:

Now that I understand the request a little better, it doesn't appear that pyspark has inplace transform support yet. See this answer.

But you can do it via a join.

df2=df.groupby('users').count()
df.join(df2, df.users==df2.users, "left")\
    .drop(df2.users).drop(df.comments)

+-----+-----+
|users|count|
+-----+-----+
|    1|    2|
|    1|    2|
|    3|    1|
|    2|    2|
|    2|    2|
+-----+-----+
Community
  • 1
  • 1
data_steve
  • 1,548
  • 12
  • 17
  • Thanks @data_steve. I think my issue is in the `transform` part. I'd like to insert the counted values as a new column (or the `user` column) in the same dataframe. Anyway easy ways to do this? – MomoPP Feb 07 '17 at 19:54
  • @MomoPP usually you'd give a small data example to illustrate what you mean, both from where you are starting and what you want to output to look like. I'm a little confused by this wording in your post `replace it by the count values`. What does it refer to: the user column or the dataframe? – data_steve Feb 07 '17 at 19:57
  • 1
    Thank you very much, Steve, for going out of your way and helping me out here. Sorry for not providing enough details on this problem before. It totally makes sense now. Excellent work. – MomoPP Feb 07 '17 at 21:47
  • 1
    @momopp We're all new at this. I found this resource helpful on how to construct better questions. But pyspark community here is thin compared to other groups. http://stackoverflow.com/help/how-to-ask – data_steve Feb 07 '17 at 22:01