How can I add column from a spark dataframe in my spark dataframe(Using Pyspark)?

Question

I have two spark dataframes, I want to add a column from one spark dataframe to another.

My code is:

new = df.withColumn("prob", tr_df.prob)

Here I want to add column result2 which is in tr_df to my dataframe df named as prob. I searched for this but nothing worked for me and I'm getting an error--

AnalysisException: u'resolved attribute(s) prob#579 missing from q1_n_words#388L,prediction#510,res1#390,q2_n_words#389L,tfidf_word_match#384,Average#379,prob#385,probability#485,Cosine#381,word_m#383,rawPrediction#461,features#438,res2#391,question1#373,Jaccard#382,test_id#372L,raw_pred#377,question2#374,q2len#376,Common#378L,result2#387,q1len#375,result1#386,Percentage#380 in operator !Project [test_id#372L, question1#373, question2#374, q1len#375, q2len#376, raw_pred#377, Common#378L, Average#379, Percentage#380, Cosine#381, Jaccard#382, word_m#383, tfidf_word_match#384, prob#579 AS prob#634, result1#386, result2#387, q1_n_words#388L, q2_n_words#389L, res1#390, res2#391, features#438, rawPrediction#461, probability#485, prediction#510];;\n!Project [test_id#372L, question1#373, question2#374, q1len#375, q2len#376, raw_pred#377, Common#378L, Average#379, Percentage#380, Cosine#381, Jaccard#382, word_m#383, tfidf_word_match#384, prob#579 AS prob#634, result1#386, result2#387, q1_n_words#388L, q2_n_words#389L, res1#390, res2#391, features#438, rawPrediction#461, probability#485, prediction#510]\n+- Project [test_id#372L, question1#373, question2#374, q1len#375, q2len#376, raw_pred#377, Common#378L, Average#379, Percentage#380, Cosine#381, Jaccard#382, word_m#383, tfidf_word_match#384, prob#385, result1#386, result2#387, q1_n_words#388L, q2_n_words#389L, res1#390, res2#391, features#438, rawPrediction#461, probability#485, UDF(rawPrediction#461) AS prediction#510]\n   +- Project [test_id#372L, question1#373, question2#374, q1len#375, q2len#376, raw_pred#377, Common#378L, Average#379, Percentage#380, Cosine#381, Jaccard#382, word_m#383, tfidf_word_match#384, prob#385, result1#386, result2#387, q1_n_words#388L, q2_n_words#389L, res1#390, res2#391, features#438, rawPrediction#461, UDF(rawPrediction#461) AS probability#485]\n      +- Project [test_id#372L, question1#373, question2#374, q1len#375, q2len#376, raw_pred#377, Common#378L, Average#379, Percentage#380, Cosine#381, Jaccard#382, word_m#383, tfidf_word_match#384, prob#385, result1#386, result2#387, q1_n_words#388L, q2_n_words#389L, res1#390, res2#391, features#438, UDF(features#438) AS rawPrediction#461]\n         +- Project [test_id#372L, question1#373, question2#374, q1len#375, q2len#376, raw_pred#377, Common#378L, Average#379, Percentage#380, Cosine#381, Jaccard#382, word_m#383, tfidf_word_match#384, prob#385, result1#386, result2#387, q1_n_words#388L, q2_n_words#389L, res1#390, res2#391, UDF(struct(q1len#375, q2len#376, cast(q1_n_words#388L as double) AS q1_n_words_double_VectorAssembler_4158baa8e5b4f3aced2b#435, cast(q2_n_words#389L as double) AS q2_n_words_double_VectorAssembler_4158baa8e5b4f3aced2b#436, cast(Common#378L as double) AS Common_double_VectorAssembler_4158baa8e5b4f3aced2b#437, Average#379, Percentage#380, Cosine#381, Jaccard#382, word_m#383, prob#385, raw_pred#377, res1#390, res2#391)) AS features#438]\n            +- LogicalRDD [test_id#372L, question1#373, question2#374, q1len#375, q2len#376, raw_pred#377, Common#378L, Average#379, Percentage#380, Cosine#381, Jaccard#382, word_m#383, tfidf_word_match#384, prob#385, result1#386, result2#387, q1_n_words#388L, q2_n_words#389L, res1#390, res2#391]\n'

tr_df Schema --

tr_df.printSchema()
root
 |-- prob: float (nullable = true)

df Schema --

df.printSchema()
root
 |-- test_id: long (nullable = true)

Please Help! Thanks in advance.

Do you want to add the same value to every row in `df`? Or can you join by some condition `df` and `tr_df`? — iurii_n, May 05 '17 at 09:32
No every row will contain different value. I do not want it with any condition applied. — vishakha deshmukh, May 05 '17 at 09:35
well, if every row has different value then you have to join these DataFrames and select needed columns. Can you provide schema of both dataframes? — iurii_n, May 05 '17 at 09:36
Please see my edited question. I have tried to join these two dataframes but after that when I tried to write it in csv it is giving me an error **AnalysisException: u'Cartesian joins could be prohibitively expensive and are disabled by default. To explicitly enable them, please set spark.sql.crossJoin.enabled = true;'**. I searched for its solutions and used **spark.conf.set("spark.sql.crossJoin.enabled", "true")** before joining the dataframes but the error was still the same. — vishakha deshmukh, May 05 '17 at 09:44
Which columns did you use to do join? because in your schema I don't see any related columns you can use to join. Or you want randomly add values? — iurii_n, May 05 '17 at 10:29

score 0 · Answer 1 · edited May 23 '17 at 12:26

0

As the error message clearly states you need to set spark.sql.crossJoin.enabled = true to your spark configuration

You can set the same something like below:

val sparkConf = new SparkConf().setAppName("Test") 
sparkConf.set("spark.sql.crossJoin.enabled", "true")

Then get or create SparkSession by passing this SparkConf

val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()

Then do your join...

Source: How to enable Cartesian join in Spark 2.0?

edited May 23 '17 at 12:26

Community

1
1

answered May 05 '17 at 10:00

Sanchit Grover

998
1
6
9

@Sanchit Can you please provide this solution in pyspark. I did it in pyspark like this --`spark.conf.set("spark.sql.crossJoin.enabled", "true")` `n = df.join(tr_df)`. But it didn't work for me. – vishakha deshmukh May 05 '17 at 10:09

Hegde · Answer 2 · 2020-09-08T07:19:40.973

in pyspark you can do this in below way. Hope it will be useful.

>>> spark.conf.set("spark.sql.crossJoin.enabled", True)
>>> df1.show()
+----+
|col1|
+----+
|  23|
|  56|
|  78|
|  31|
+----+

>>> df2.show()
+----+
|col2|
+----+
|  87|
|  45|
|  23|
|  11|
+----+

>>> final = df1.crossJoin(df2)
>>> final.withColumnRenamed('col2', 'result').show()
+----+------+                                                                   
|col1|result|
+----+------+
|  23|    87|
|  23|    45|
|  23|    23|
|  23|    11|
|  56|    87|
|  56|    45|
|  56|    23|
|  56|    11|
|  78|    87|
|  78|    45|
|  78|    23|
|  78|    11|
|  31|    87|
|  31|    45|
|  31|    23|
|  31|    11|
+----+------+

Please copy/paste your code and the output, instead of images — AlexisG, Sep 08 '20 at 07:12

How can I add column from a spark dataframe in my spark dataframe(Using Pyspark)?

2 Answers2