Convert a spark dataframe to a column

Question

I have a org.apache.spark.sql.DataFrame and I would like to convert it into a column: org.apache.spark.sql.Column. So basically, this is my dataframe: val filled_column2 = x.select(first(("col1"),ignoreNulls = true).over(window)) that I want to convert, it into an sql spark column. Anyone could help on that ? Thank you,

@Jaime Caffarel: this is exactly what i am trying to do, this will give you more visibility. You may also check the error msg in the 2d screenshot

Does this answer your question? [pyspark collect\_set or collect\_list with groupby](https://stackoverflow.com/questions/37580782/pyspark-collect-set-or-collect-list-with-groupby) — mazaneicha, Oct 11 '22 at 11:27

Jaime Caffarel · Answer 1 · 2022-10-12T10:58:16.017

From the documentation of the class org.apache.spark.sql.Column

A column that will be computed based on the data in a DataFrame. A new column is constructed based on the input columns present in a dataframe:

df("columnName") // On a specific DataFrame.
col("columnName") // A generic column no yet associcated with a DataFrame. col("columnName.field") // Extracting a struct field col("a.column.with.dots") // Escape . in column names. $"columnName" // Scala short hand for a named column. expr("a + 1") // A column that is constructed from a parsed SQL Expression. lit("abc") // A column that produces a literal (constant) value.

If filled_column2 is a DataFrame, you could do:

filled_column2("col1")

******** EDITED AFTER CLARIFICATION ************

Ok, it seems to me that what you are trying to do is a JOIN operation. Assuming that the product_id is a unique key per each row, I would do something like this:

val filled_column = df.select(df("product_id"), last(("last_prev_week_nopromo"), ignoreNulls = true) over window)

This way, you are also selecting the product_id that you will use as key. Then, you can do the following

val promo_txn_cnt_seas_df2 = promo_txn_cnt_seas_df1
            .join(filled_column, promo_txn_cnt_seas_df1("product_id") ===  filled_column("driver_id"), "inner")
// orderBy("product_id", "week")... (the rest of the operations)

Is this what you are trying to achieve?

'filled_column2: org.apache.spark.sql.DataFrame = [first(col1) OVER (PARTITION BY x1 ORDER BY x2 ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string]' .. this is filled_column2 — Olfa2, Oct 11 '22 at 15:14

Convert a spark dataframe to a column

1 Answers1