I have a table in hive with called test
with columns id
and name
Now I have another table in hive called mysql with columns id
, name
and city
.
Now I want to compare schema of both tables and add column difference to the hive table test
.
hive_df= sqlContext.table("testing.test")
mysql_df= sqlContext.table("testing.mysql")
hive_df.dtypes
[('id', 'int'), ('name', 'string')]
mysql_df.dtypes
[('id', 'int'), ('name', 'string'), ('city', 'string')]
hive_dtypes=hive_df.dtypes
hive_dtypes
[('id', 'int'), ('name', 'string')]
mysql_dtypes= mysql_df.dtypes
diff = set(mysql_dtypes) ^ set(hive_dtypes)
diff
set([('city', 'string')])
for col_name, col_type in diff:
... sqlContext.sql("ALTER TABLE testing.test ADD COLUMNS ({0} {1})".format(col_name, col_type))
...
After doing all this the hive table test
will have new column city
added with null values as expected.
Now when I close the spark session and open a new spark session and when I do
hive_df= sqlContext.table("testing.test")
and then
hive_df
I should get
DataFrame[id: int, name: string, city: string]
But I get this
DataFrame[id: int, name: string]
When I do a desc hive table test
hive> desc test;
OK
id int
name string
city string
Why is the schema change not reflecting in the Pyspark dataframe after we alter the corresponding hive table?
FYI I am using spark 1.6