Accessing count column in PySpark

Question

code:

mydf = testDF.groupBy(testDF.word).count()
mydf.show()

output:

+-----------+-----+
|       word|count|
+-----------+-----+
|        she| 2208|
|    mothers|   93|
|       poet|   59|
|     moving|   18|
|     active|    6|
|       foot|  169|

I wanted to order this data frame based on word count in descending order.

code:

countDF = mydf.orderBy(mydf.count.desc())
countDF.show()

Error:

AttributeError: 'function' object has no attribute 'desc'

Please let me know on where am I going wrong.

Check this http://stackoverflow.com/questions/30332619/how-to-sort-by-column-in-descending-order-in-spark-sql — κροκς, Jul 14 '16 at 17:30
@kgiou It is not a duplicate. Problem here is Python specific. — zero323, Jul 14 '16 at 22:15

zero323 · Accepted Answer · 2016-07-14T17:44:38.420

4

Well, dot notation is not the best method to access columns. While DataFrame provides column aware __getattr__ you can encounter conflicts like this one, where name will resolve to a method (here DataFrame.count) so instead it is better to use bracket notation:

mydf.orderBy(mydf["count"].desc())

or col function:

from pyspark.sql.functions import col

mydf.orderBy(col("count").desc())

to reference columns.

edited Jul 14 '16 at 17:44

answered Jul 14 '16 at 17:35

zero323

322,348
103
959
935

There is another option, `mydf.sort(-col("count"))` – Alberto Bonsanto Jul 14 '16 at 17:55
1

@AlbertoBonsanto `desc("count")` as well. `desc` methods are slightly more generic because don't require a type that supports `-`. Still, I think it is more about `getattr` mechanics than sorting itself. – zero323 Jul 14 '16 at 17:57

Accessing count column in PySpark

1 Answers1