I have two tables with same column names, same data, same number of rows but ordering of rows might differ. Now I select column A from table_1 and column A from table_2 and compare the values. How can i achieve this using PySpark SQL can I do sha2/md5 checksum and compare?
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
from pyspark.sql.types import *
from pyspark.sql import Row
import pyspark.sql.functions as f
app_name="test"
table1="DB1.department"
table2="DB2.department"
conf = SparkConf().setAppName(app_name)
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
query1="select * from %s" %(table1)
df1 = sqlContext.sql(query1)
query2="select * from %s" %(table2)
df2 = sqlContext.sql(query2)
df3=sqlContext.sql(SELECT DB1.departmentid FROM DB1.department a FULL JOIN
DB2.department b ON a.departmentid = b.departmentid WHERE a.departmentid
IS NULL OR b.departmentid IS NULL)
df5=sqlContext.sql("select md5(departmentid) from department1")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/context.py", line 580, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
813, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"cannot resolve 'md5(departmentid)'
due to data type mismatch: argument 1 requires binary type, however,
'departmentid' is of bigint type.; line 1 pos 11"
when tried with md5 checksum it is saying it expects binarytype but department id is bigint
Table1:
departmentid departmentname departmentaddress
1 A Newyork
2 B Newjersey
3 C SanJose
4 D WashingtonDC
5 E Mexico
6 F Delhi
7 G Pune
8 H chennai
Table2:
departmentid departmentname departmentaddress
7 G Pune
8 H chennai
1 A Newyork
2 B Newjersey
3 C SanJose
4 D WashingtonDC
5 E Mexico
6 F Delhi
Here in table two order of rows has just changed but still data remained so, now technically these two tables are identical. Until and unless new row gets added or values modified the two tables are identical (Tables are taken for example and explanation, in real we deal with Bigdata)