0

Issue trying to get a value from a DataFrame

I'm using pyspark in Databricks, I'm trying to get the amount of rows as a value to do some calculation

What I did:

f=sqlContext.sql('SELECT COUNT(*) AS COUNTF FROM SOOMLA')
#It´s a dataframe

Now I want to get this number into the DataFrame as a value to make calculation, I tried already:

f['COUNTF'].iloc[0]

but I got:

Out[158]: Column<b'COUNTF[iloc][0]'>

How can I get this 439016392 as a value?

pault
  • 41,343
  • 15
  • 107
  • 149
  • 2
    I think you want to do `val = sqlContext.sql('SELECT * FROM SOOMLA').count()` and then val should be your integer with the number of rows – Ben.T May 29 '19 at 15:55
  • 2
    Possible duplicate of [Store aggregate value of a PySpark dataframe column into a variable](https://stackoverflow.com/questions/36987454/store-aggregate-value-of-a-pyspark-dataframe-column-into-a-variable) and also maybe [this post](https://stackoverflow.com/questions/55594176/how-to-calculate-the-number-of-rows-of-a-dataframe-efficiently). – pault May 29 '19 at 16:10
  • 1
    @Alejandro you *could* also [convert to a pandas DataFrame](https://stackoverflow.com/questions/41826553/convert-between-spark-sql-dataframe-and-pandas-dataframe?noredirect=1&lq=1): `f.toPandas()['COUNTF'].iloc[0]` – pault May 29 '19 at 16:17

2 Answers2

1

I guess you're mixing Pandas df with Spark df when using iloc.

type(f['COUNTF'])

pyspark.sql.column.Column

You can do it, as Ben T suggested by:

f=sqlContext.sql('SELECT * FROM SOOMLA').count()
print(f)

or convert the df to pandas first and then:

f=sqlContext.sql('SELECT COUNT(*) AS COUNTF FROM SOOMLA').toPandas()['COUNTF'].iloc[0]
print(f)
michalrudko
  • 1,432
  • 2
  • 16
  • 30
0

You can run collect() and extract the first item into a variables. Here is example:

f = sqlContext.sql('SELECT COUNT(*) AS COUNTF FROM SOOMLA').collect()[0][0]

print(f)
#3

type(f)
#int

collect() returns list of Rows. So collect()[0] will return first row from list and collect()[0][0] will return first element of first row from list.

Here we are running COUNT(*) and it will return one record/row. So in this case collect() is returning one list with one row only and we need to extract first element from first row so collect()[0][0] will work.

I hope this explanation helps!

Shantanu Sharma
  • 3,661
  • 1
  • 18
  • 39