how to create a schema of a table by fetching column names (generic code)

Question

I have a file which saves column name of each table, and this file is present in blob storage

table1=['column1','column2','column3']
table2=['column4','column5','column5']

I want to read this txt file in databricks and use these columns to create schema for each table likewise, all fields will be stringtype

table1_schema = StructType([
        StructField("column1", StringType(), True),
        StructField("column2", StringType(), True),
        StructField("column3", StringType(), True),
        
    ])

table2_schema=StructType([
        StructField("column4", StringType(), True),
        StructField("column5", StringType(), True),
        StructField("column6", StringType(), True),
        
    ])

how to achieve this scenario using pyspark?

are the column names stored as arrays, and each table's column names as separate column of the table in blob? or just a simple text file that has that line as a string row? — samkart, Aug 19 '22 at 06:40

samkart · Answer 1 · 2022-08-19T07:43:43.460

I'm not versed in blob storage, but let's say you're able to read the text file from that storage (see this for help), you can create a dict of the schema strings for each table.

FYI - schemas can also be passed as 'column1 string, column2 string, column3 string'. And, I'll share an example for the same.

For this example, I've stored a .txt file on an accessible drive that has two rows with the following strings.

"table1=['column1','column2','column3']"
"table2=['column4','column5','column6']"

The idea is to split the string by "=" and use the first part as a dict key (for table name) and create a schema string using the second part (that can be evaluated as a list).

schema_dict = {}

with open('./drive/MyDrive/blahblah.txt', 'r') as col_txt:
    for linerow in col_txt:
        stuff = linerow.split('=')
        schema_dict[stuff[0].strip()] = ', '.join([k+' string' for k in eval(stuff[1])])

schema_dict
# {'table1': 'column1 string, column2 string, column3 string',
#  'table2': 'column4 string, column5 string, column6 string'}

schema_dict['table2']
# column4 string, column5 string, column6 string

Alternatively, StructType([StructField()]) creation would also be easy.

schema_dict = {}

with open('./drive/MyDrive/blahblah.txt', 'r') as col_txt:
    for linerow in col_txt:
        stuff = linerow.split('=')
        schema_dict[stuff[0].strip()] = StructType([StructField(k, StringType(), True) for k in eval(stuff[1])])

schema_dict
# {'table1': StructType(List(StructField(column1,StringType,true),StructField(column2,StringType,true),StructField(column3,StringType,true))),
#  'table2': StructType(List(StructField(column4,StringType,true),StructField(column5,StringType,true),StructField(column6,StringType,true)))}

schema_dict['table1']
# StructType(List(StructField(column1,StringType,true),StructField(column2,StringType,true),StructField(column3,StringType,true)))

type(schema_dict['table1'])
# pyspark.sql.types.StructType

how to create a schema of a table by fetching column names (generic code)

1 Answers1