I got an error in Pyspark:
AnalysisException: u'Resolved attribute(s) week#5230 missing from
longitude#4976,address#4982,minute#4986,azimuth#4977,province#4979,
action_type#4972,user_id#4969,week#2548,month#4989,postcode#4983,location#4981
in operator !Aggregate [user_id#4969, week#5230], [user_id#4969,
week#5230, count(distinct day#4987) AS days_per_week#3605L].
Attribute(s) with the same name appear in the operation: week.
Please check if the right attribute(s) are used
This seems to come from a snippet of code where the agg
function is used:
df_rs = df_n.groupBy('user_id', 'week')
.agg(countDistinct('day').alias('days_per_week'))
.where('days_per_week >= 1')
.groupBy('user_id')
.agg(count('week').alias('weeks_per_user'))
.where('weeks_per_user >= 5').cache()
However I do not see the issue here. And I have previously used this line of code on the same data, many times.
EDIT: I have been looking through the code and the type of error seems to come from joins of this sort:
df = df1.join(df2, 'user_id', 'inner')
df3 = df4.join(df1, 'user_id', 'left_anti).
but still have not solved the problem yet.
EDIT2: Unfortunately the suggested question is not similar to mine, as this is not a question of column name ambiguity but of missing attribute, which seems not to be missing upon inspecting the actual dataframes.