I have nested schema with depth level of 6. I am facing issues while traversing each element in the schema to modify a column. I have list which contains column names which needs to be modify(hash/anonymized).
My initial thought is to traverse each element in the schema and compare column with the list items and modify once there is match. But I do not know how to do it.
List values:['type','name','work','email']
Sample schema:
-- abc: struct (nullable = true)
| |-- xyz: struct (nullable = true)
| | |-- abc123: string (nullable = true)
| | |-- services: struct (nullable = true)
| | | |-- service: array (nullable = true)
| | | | |-- element: struct (contains Null = true)
| | | | | |-- type: string (nullable = true)
| | | | | |-- subtype: string (nullable = true)
| | | |-- name : string(nullable = true)
| | |-- details: struct (nullable =true)
| | | | -- work: (nullable = true)
Note: If I flatten the schema it creates 600+ columns. Thus I am looking for a solution which modify the column dynamically(no hardcoding)
EDIT: if it helps in anyway I am sharing my code where I am trying modify the value, but its broken
def change_nested_field_value(schema, new_df,fields_to_change, parent=""):
new_schema = []
if isinstance(schema, StringType):
return schema
for field in schema:
full_field_name = field.name
short_name = full_field_name.split('.')
if parent:
full_field_name = parent + "." + full_field_name
#print(full_field_name)
if short_name[-1] not in fields_to_change:
if isinstance(field.dataType, StructType):
inner_schema = change_nested_field_value(field.dataType,new_df, fields_to_change, full_field_name)
new_schema.append(StructField(field.name, inner_schema))
elif isinstance(field.dataType, ArrayType):
inner_schema = change_nested_field_value(field.dataType.elementType, new_df,fields_to_change, full_field_name)
new_schema.append(StructField(field.name, ArrayType(inner_schema)))
else:
new_schema.append(StructField(field.name, field.dataType))
# else:
############ this is where I have access to the nested element. I need to modify the value here
# print(StructField(field.name, field.dataType))
return StructType(new_schema)