0

I have a DataFrame df with a column column and I would like to convert column into a vector (e.g. a DenseVector) so that I can use it in vector and matrix products.

Beware: I don't need a column of vectors; I need a vector object.

How to do this?

I found out the vectorAssembler function (link) but this doesn't help me, as it converts some DataFrame columns into a vector columns, which is still a DataFrame column; my desired output should instead be a vector.


About the goal of this question: why am I trying to convert a DF column into a vector? Assume I have a DF with a numerical column and I need to compute a product between a matrix and this column. How can I achieve this? (The same could hold for a DF numerical row.) Any alternative approach is welcome.

Vanni Rovera
  • 128
  • 3
  • 12
  • 1
    please provide some sample data, along with the desired output – desertnaut Nov 05 '17 at 10:13
  • Possible duplicate of https://stackoverflow.com/questions/42138482/pyspark-how-do-i-convert-an-array-i-e-list-column-to-vector – MaFF Nov 05 '17 at 12:26
  • 2
    I don't think this is a duplicate. If I understand well, this other post is trying to convert the type of a DF column; I rather want to extract a column from a DF and convert it into a vector, being it no longer a column of any DF. – Vanni Rovera Nov 07 '17 at 08:01

1 Answers1

6

How:

DenseVector(df.select("column_name").rdd.map(lambda x: x[0]).collect())

but it doesn't make sense in any practical scenario.

Spark Vectors are not distributed, therefore are applicable only if data fits in memory of one (driver) node. If this is the case you wouldn't use Spark DataFrame for processing.

  • 1
    Well, so assume I have a DF with a numerical column and I need to compute a product between a matrix and this column. How can I achieve this? (The same could hold for a DF numerical row.) – Vanni Rovera Nov 07 '17 at 08:05