6

I have spark dataframe Here it is

I would like to fetch the values of a column one by one and need to assign it to some variable?How can it be done in pyspark.Sorry I am a newbie to spark as well as stackoverflow.Please forgive the lack of clarity in question

Fasty
  • 784
  • 1
  • 11
  • 34
  • For which column you want to do this? – pvy4917 Nov 13 '18 at 15:36
  • There are some fundamental misunderstandings here about how spark dataframes work. Don't think about iterating through values one by one- instead think about operating on all the values at the same time (after all, it's a parallel, distributed architecture). This seems like an [XY problem](http://xyproblem.info/). Please explain, in detail, what you are trying to do and try to [edit] your question to provide a [reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples). – pault Nov 13 '18 at 15:38
  • Also, [don't post pictures of or links to code/data](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question). – pault Nov 13 '18 at 15:41

2 Answers2

4
col1=df.select(df.column_of_df).collect()
list1=[str(i[0]) for i in col1]
#after this we can iterate through list (list1 in this case)
Avinash
  • 41
  • 3
  • 1
    What if there are more number of rows? `collect()` operation will be costly right? – pvy4917 Nov 13 '18 at 16:40
  • repartition the dataframe to the same number of nodes, the instance is running on, before using collect() to reduce time and memory costs. – Avinash Nov 13 '18 at 17:58
0

I don't understand exactly what you are asking, but if you want to store them in a variable outside of the dataframes that spark offers, the best option is to select the column you want and store it as a panda series (if they are not a lot, because your memory is limited).

from pyspark.sql import functions as F

var = df.select(F.col('column_you_want')).toPandas()

Then you can iterate on it like a normal pandas series.

Manrique
  • 2,083
  • 3
  • 15
  • 38
  • 1
    No ,I need to access one value in each iteration and store it in a variable.I dont want to use toPandas as it consumes more memory! – Fasty Nov 13 '18 at 15:16