0

I wonder what is an efficient way to verify schema for two data frames where the field order in the schema doesn't matter and I just want the data frames to have the same type for each field name they have in common. Also the schema might be nested so for example a StructField might be a StructType itself.

1 Answers1

0

Well, approach-wise one very simple and direct way would be to do the following,

  1. Serialize the schemas of the two data frames as JSON strings
  2. Read these JSON strings as objects and then compare them.

This approach will work for all nested schemas except Map columns which are schema-less.

Since you haven't mentioned the programming language, here's a short python example,

from pyspark.sql.types import *
import json

def ordered(obj):
    if isinstance(obj, dict):
        return sorted((k, ordered(v)) for k, v in obj.items())
    if isinstance(obj, list):
        return sorted(ordered(x) for x in obj)
    else:
        return obj


def compare_schemas(schema1: StructType, schema2: StructType) -> bool:
    j1 = json.loads(schema1.json())
    j2 = json.loads(schema2.json())

    return ordered(j1) == ordered(j2)


if __name__ == '__main__':
    nested_section1 = StructType().add("x1", StringType(), True).add("x2", IntegerType(), True)
    nested_section2 = StructType().add("x2", IntegerType(), True).add("x1", StringType(), True)

    s1 = StructType().add("c1", StringType(), True).add("c2", nested_section1, True)
    s2 = StructType().add("c2", nested_section2, True).add("c1", StringType(), True)

    print(compare_schemas(s1, s2))

Special case: If s2 is subset of s1, then in place of comparison, you will have to do a difference of the two JSONs and then check.

PS: I took some comparison code from here to quickly answer this.

Chitral Verma
  • 2,695
  • 1
  • 17
  • 29