4

I'm issuing the following SQL statement against a tempview

cloudantdata.createOrReplaceTempView("washingflat")

sqlDF = spark.sql("SELECT temperature FROM washingflat")
sqlDF.rdd.map(lambda row : row.temperature).collect()

I'm just interested in the plain (unwrapped) integer values. All I've tried so far with the dataframe API always returned Row-objects wrapping the values I'm interested in.

Is there a way to get the scalar contents without using the RDD api?

Romeo Kienzler
  • 3,373
  • 3
  • 36
  • 58

4 Answers4

2

You could manually put them in a list after collecting them as below

temps = []
rows = sqlDF.collect()
for r in rows:
    temps.append(r['temperature'])
David
  • 11,245
  • 3
  • 41
  • 46
  • I'm really missing the map function on dataframes – Romeo Kienzler Nov 01 '16 at 20:07
  • `map` function converted `DataFrame` to `RDD`. There is no difference between old `df.map(...)` and new `df.rdd.map(...)`. `collect` converts to `RDD` first too. –  Nov 01 '16 at 20:40
  • @RomeoKienzler Yea, Lost is correct. Check out http://stackoverflow.com/questions/39535447/attributeerror-dataframe-object-has-no-attribute-map/39536218#39536218 – David Nov 01 '16 at 20:57
1

Try:

>>> from itertools import chain
>>>
>>> chain.from_iterable(sqlDF.collect())
1

So given an input DataFrame

import numpy as np
import pandas as pd
test_df = pd.DataFrame({'Age': np.random.uniform(0,100, size = (100,)), 'City': 'LA'})
sqlContext.createDataFrame(test_df).registerTempTable('AgeTable')

There are two (primary) ways for extracting a value without using the Row abstraction. The first is to use the .toPandas() method of a DataFrame / SQL Query

print(sqlContext.sql("SELECT Age FROM AgeTable").toPandas()['Age'])

This returns a Pandas DataFrame / Series.

The second is to actually group the data inside of SQL and then extract it from a single Row object

al_qry = sqlContext.sql("SELECT City, COLLECT_SET(Age) as AgeList FROM AgeTable GROUP BY City")
al_qry.first()[0].AgeList

This returns a raw python list.

The more efficient way is with the toPandas method and this approach will likely be improved more in the future.

kmader
  • 1,319
  • 1
  • 10
  • 13
0
temp_list = [str(i.temperature) for i in sqlDF.select("temperatue").collect()]
braj
  • 2,545
  • 2
  • 29
  • 40