I wonder what is an efficient way to verify schema for two data frames where the field order in the schema doesn't matter and I just want the data frames to have the same type for each field name they have in common. Also the schema might be nested so for example a StructField might be a StructType itself.
Asked
Active
Viewed 425 times
1 Answers
0
Well, approach-wise one very simple and direct way would be to do the following,
- Serialize the schemas of the two data frames as JSON strings
- Read these JSON strings as objects and then compare them.
This approach will work for all nested schemas except Map columns which are schema-less.
Since you haven't mentioned the programming language, here's a short python example,
from pyspark.sql.types import *
import json
def ordered(obj):
if isinstance(obj, dict):
return sorted((k, ordered(v)) for k, v in obj.items())
if isinstance(obj, list):
return sorted(ordered(x) for x in obj)
else:
return obj
def compare_schemas(schema1: StructType, schema2: StructType) -> bool:
j1 = json.loads(schema1.json())
j2 = json.loads(schema2.json())
return ordered(j1) == ordered(j2)
if __name__ == '__main__':
nested_section1 = StructType().add("x1", StringType(), True).add("x2", IntegerType(), True)
nested_section2 = StructType().add("x2", IntegerType(), True).add("x1", StringType(), True)
s1 = StructType().add("c1", StringType(), True).add("c2", nested_section1, True)
s2 = StructType().add("c2", nested_section2, True).add("c1", StringType(), True)
print(compare_schemas(s1, s2))
Special case: If s2 is subset of s1, then in place of comparison, you will have to do a difference of the two JSONs and then check.
PS: I took some comparison code from here to quickly answer this.

Chitral Verma
- 2,695
- 1
- 17
- 29