How can I combine multiple dataframes horizontally, based on 1 column, in Pyspark?

Question

I want to combine these 3 dataframes, based on their ID columns, and get the below output. I am after a short way that I can use it for combining many more number of dataframes later.

Input:

+---+---+---+
| ID|  a|  b|
+---+---+---+
|  A|  1|  1|
|  B|  2|  2|
+---+---+---+
+---+---+---+
| ID|  c|  d|
+---+---+---+
|  A|  3|333|
|  B|  4|444|
+---+---+---+
+---+---+---+
| ID|  e|  f|
+---+---+---+
|  A|555|  5|
|  B|666|  6|
+---+---+---+

Output:

+---+---+---+---+---+---+---+
| ID|  a|  b|  c|  d|  e|  f|
+---+---+---+---+---+---+---+
|  A|  1|  1|  3|333|555|  5|
|  B|  2|  2|  4|444|666|  6|
+---+---+---+---+---+---+---+

Answer: For anyone who might find it useful later!

# create list of dataframes
list_df = [df1, df2, df3]

# merge all at once
df_all = reduce(lambda x, y: x.join(y, on="ID"), list_df)

This question may have already been answered here (unfortunately it looks like it's not possible, or difficult): https://stackoverflow.com/questions/49763009/stack-spark-dataframes-horizontally-equivalent-to-pandas-concat-or-r-cbind — Stephen, Oct 15 '22 at 03:00
Did I miss something? Is that just df1 inner join df2 inner join df3 based on id column? — Jonathan Lam, Oct 15 '22 at 03:14
@JonathanLam it is what you said, but I am after a shorter way that I don't have to bring all one after another. So, that I can use it later for more number of dataframes. — Zrb, Oct 15 '22 at 06:13

score 0 · Answer 1 · answered Oct 15 '22 at 03:16

0

The 3 dataframes can be jioned according to the ID column.

df = df1.join(df2, 'ID').join(df3, 'ID')

answered Oct 15 '22 at 03:16

过过招

3,722
2
4
11

How can I combine multiple dataframes horizontally, based on 1 column, in Pyspark?

1 Answers1