Currently I have file structure defined so dataframe schema is like below(Sample data used from different source)
schema = StructType([
StructField('id', StringType(), True),
StructField('dept', StringType(), True),
StructField('salary', IntegerType(), True),
StructField('location', StringType(), True)
])
df = spark.createDataFrame([(36636,'Finance' ,3000, 'USA'),
(40288,'Finance' , 5000, 'IND'),
(42114,'Sales', 3900,'USA'),
(39192,'Marketing',2500, 'CAN'),
(34534,'Sales', 6500, 'USA')],
schema=schema)
I do below operation to create map type for 2 columns like below
df = df.withColumn("propertiesMap",create_map(
lit("salary"),col("salary"),
lit("location"),col("location")
)).drop("salary","location")
And my dataframe looks like this
+-----+---------+---------------------------------+
|id |dept |propertiesMap |
+-----+---------+---------------------------------+
|36636|Finance |[salary -> 3000, location -> USA]|
|40288|Finance |[salary -> 5000, location -> IND]|
|42114|Sales |[salary -> 3900, location -> USA]|
|39192|Marketing|[salary -> 2500, location -> CAN]|
|34534|Sales |[salary -> 6500, location -> USA]|
+-----+---------+---------------------------------+
Going forward the input file may have dynamic columns like
file1.csv
id,dept,Salary,location
file2.csv
id,dept,salary
file3.csv
id,dept
file4.csv
id,dept,firstname,lastname,middlename,address
In all the cases id,dept
do not change. But all other columns are dynamic. I process file by file
For Example take file1.csv from above
fixed_columns = [id,dept]
all_columns= df.columns
dynamic_col = list(set(all_columns) - set(fixed_columns))
Gives
dynamic_col = [salary, location]
And I want something like and to use append?. Not sure
for i in dynamic_col;
df = df.withColumn('propertiesMap', create_map( lit(i), col(i)))
Once all files got processed and appended into final dataframe it has to looklike
+-----+---------+------------------------------------------------------+
|id |dept |propertiesMap |
+-----+---------+------------------------------------------------------+
|36636|Finance |[salary -> 3000, location -> USA] |
|40288|Finance |[salary -> 5000, location -> IND] |
|42114|Sales |[salary -> 3900, location -> USA] |
|39192|Marketing|[salary -> 2500, location -> CAN] |
|34534|Sales |[salary -> 6500, location -> USA] |
|36636|Finance |[firstname -> kevin, lastname -> Miller] |
|40288|Finance |[firstname -> aaron, lastname -> sahn] |
|42114|Sales |[firstname -> daron, lastname -> ket] |
|39192|Marketing|[] |
|34534|Sales |[firstname -> dev, lastname -> dis, middlename -> Sam]|
+-----+---------+------------------------------------------------------+
I don't use pandas.