0

Suppose I have this dataframe..

TEST_schema = StructType([StructField("col1", IntegerType(), True),\
                          StructField("col2", IntegerType(), True)])
TEST_data = [(5,-1),(4,-1),(3,3),(2,2),(1,-1),(0,-1),(0,-1),(0,2),(0,-1)]
rdd3 = sc.parallelize(TEST_data)
TEST_df = sqlContext.createDataFrame(TEST_data, TEST_schema)
TEST_df.show() 

+----+----+
|col1|col2|
+----+----+
|   5|  -1|
|   4|  -1|
|   3|   3|
|   2|   2|
|   1|  -1|
|   0|  -1|
|   0|  -1|
|   0|   2|
|   0|  -1|
+----+----+

What I want to do is count the number of '-1's specifically after col1 == 1.

so when after col1 == 1 df.count() which will returns 4.

SCouto
  • 7,808
  • 5
  • 32
  • 49
hellotherebj
  • 121
  • 8
  • please update the input and expected output so that Investigate, at this moment I have added the count based of col1=-1 remaining cols. – sathya Aug 06 '20 at 02:27

1 Answers1

0

this code might be helpful to you,

from pyspark.sql.functions import *
from pyspark.sql.types import *

test_schema = StructType([StructField("col1", IntegerType(), True),\
                          StructField("col2", IntegerType(), True)])
test_data = [(5,-1),(4,-1),(3,3),(2,2),(1,-1),(0,-1),(0,-1),(0,2),(0,-1)]
rdd3 = sc.parallelize(test_data)
df = sqlContext.createDataFrame(test_data, test_schema)
df.show()


from pyspark.sql import functions as F
from pyspark.sql.window import Window

w = Window().orderBy(lit('A'))
df = df.withColumn("row_num", row_number().over(w))

w1 =Window.orderBy('row_num').rowsBetween(Window.currentRow,Window.unboundedFollowing)

df.withColumn('count', F.count(when(df.col2==-1,1)).over(w1)).show()
'''
+----+----+-------+-----+
|col1|col2|row_num|count|
+----+----+-------+-----+
|   5|  -1|      1|    6|
|   4|  -1|      2|    5|
|   3|   3|      3|    4|
|   2|   2|      4|    4|
|   1|  -1|      5|    4|
|   0|  -1|      6|    3|
|   0|  -1|      7|    2|
|   0|   2|      8|    1|
|   0|  -1|      9|    1|
+----+----+-------+-----+
'''
sathya
  • 1,982
  • 1
  • 20
  • 37
  • Thank you, what if I want to get that column value of 4 in general? like print(x) would return literally integer 4. – hellotherebj Aug 06 '20 at 05:25
  • and also, what if I want to count only the consecutive first -1s starting from col1 ==1 . so in our case would return count of 3. – hellotherebj Aug 06 '20 at 16:16
  • Please raise a new question with proper input and out you expect. Without that it is bit difficult to understand the problem. – sathya Aug 06 '20 at 16:19
  • https://stackoverflow.com/questions/63288316/pyspark-count-the-consecutive-cell-in-the-column-with-condition – hellotherebj Aug 06 '20 at 16:51
  • I created another question maybe you can help me there :) refer to this link smart_coder. Thank you. https://stackoverflow.com/questions/63290611/pyspark-how-to-code-complicated-dataframe-calculation – hellotherebj Aug 06 '20 at 19:36