0

I have 2 dataframes

df1= 
+--------------+
|questions     |
+--------------+
|[Q1, Q2]      |
|[Q4, Q6, Q7]  |
|...           |
+---+----------+

df2 = 
+--------------------+---+---+---+---+
| Q1| Q2| Q3| Q4| Q6| Q7 | ...   |Q25|
+--------------------+---+---+---+---+
|  1|  0|  1|  0|  0| 1  | ...   |  1|
+--------------------+---+---+---+---+

I'd like to add in the first dataframe a new colum with the value of all columns defined into df1.questions.

Expected result

df1 = 
+--------------++--------------+
|questions     |values
+--------------+---------------+
|[Q1, Q2]      |[1, 0]         |
|[Q4, Q6, Q7]  |[0, 0, 1]      |
|...           |               |
+---+----------++--------------+
 

When I do

 cols_to_link = ['Q1', 'Q2']
 df2= df2.select([col for col in cols_to_link])\
 df2 = df2.withColumn('value', F.concat_ws(", ", *df2.columns))

the additionnal column is what I want, but I can't do it by mixing dataframes

It also works when I'm with df2

 df2 = df2.select([col for col in df1.select('questions').collect()[0][0]])\
 df2 = df2.withColumn('value', F.concat_ws(", ", *df2.columns))

But not when I want to go from df1

 df1= df1\
    .withColumn('value', F.concat_ws(", ", *df2.select([col for col in df1.select('questions').collect()])))

Where I'm wrong?

ItsMe
  • 395
  • 2
  • 13
Turvy
  • 882
  • 1
  • 8
  • 23

1 Answers1

1

From my example dataframes,

# df1
+------------+
|   questions|
+------------+
|    [Q1, Q2]|
|[Q4, Q6, Q7]|
+------------+

# df2
+---+---+---+---+---+---+
| Q1| Q2| Q3| Q4| Q6| Q7|
+---+---+---+---+---+---+
|  1|  0|  1|  0|  0|  1|
+---+---+---+---+---+---+

I have create the vertical dataframe and to join. You cannot refer the columns from the other dataframe in general.

cols = df2.columns
df = df2.rdd.flatMap(lambda row: [[cols[i], row[i]] for i in range(0, len(row))]).toDF(['id', 'values'])
df.show()

+---+------+
| id|values|
+---+------+
| Q1|     1|
| Q2|     0|
| Q3|     1|
| Q4|     0|
| Q6|     0|
| Q7|     1|
+---+------+

df1.join(df, f.expr('array_contains(questions, id)'), 'left') \
  .groupBy('questions').agg(f.collect_list('values').alias('values')) \
  .show()

+------------+---------+
|   questions|   values|
+------------+---------+
|    [Q1, Q2]|   [1, 0]|
|[Q4, Q6, Q7]|[0, 0, 1]|
+------------+---------+
Lamanus
  • 12,898
  • 4
  • 21
  • 47
  • Thank you. I had to change your way to create the vertical dataframe because I need more columns, so I followed this: https://stackoverflow.com/a/60212279/1568148, but I follow your stategy for complete my dataframe. – Turvy Nov 30 '21 at 16:17