2

I want to calculate min of each row in PySpark DataFrame.

In NumPy, it can be write

df.min(axis=1)

but I don't know how to do the same thing in PySpark DataFrame.

e.g. I create dataframe (my real data is approx. 1,000,000rows * 1,000cols):

df = sqlContext.createDataFrame([(10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4)], ("c1", "c2", "c3"))


+---+---+---+
| c1| c2| c3|
+---+---+---+
| 10| 10|  1|
|200|  2| 20|
|  3| 30|300|
|400| 40|  4|
+---+---+---+

and I want output below:

+---+---+---+---+
| c1| c2| c3|min|
+---+---+---+---+
| 10| 10|  1|  1|
|200|  2| 20|  2|
|  3| 30|300|  3|
|400| 40|  4|  4|
+---+---+---+---+
mrsrinivas
  • 34,112
  • 13
  • 125
  • 125
iru
  • 31
  • 4

0 Answers0