Pyspark add new record to each Row

Question

I'm using Spark 2.3.1. I'm reading data from json file and there are five records of <class 'pyspark.sql.types.Row'> type like

Row(age=24, payloadId=1, salary=2900)

I want to add a new value in all five records, new value is in Dictionary format like this

{'age_condition':True,'salary_condition':True}

so, now new Row should be like this

Row(age=24, payloadId=1, salary=2900, Result={'age_condition':True,'salary_condition':True})

score 0 · Answer 1 · answered Aug 18 '20 at 10:39

What about this way? Be aware that the Result column is treated as string type not dict.

import pyspark.sql.functions as f
from pyspark.sql.types import Row

row_list = [Row(age=24, payloadId=1, salary=2900)]
row_add = {'age_condition':True,'salary_condition':True}

spark.createDataFrame(row_list) \
  .withColumn('Result', f.lit(str(row_add))) \
  .collect()

[Row(age=24, payloadId=1, salary=2900, Result="{'age_condition': True, 'salary_condition': True}")]

score 0 · Answer 2 · answered Aug 18 '20 at 10:53

I don't know why you want to complicate things by adding dictionaries into dataframe column,you should add two new columns age_condition and salary_condition of boolean type instead.

This should do what you want...

from pyspark.sql.types import *

schema = StructType([StructField("dict", StructType([StructField("age_condition", BooleanType(), True), StructField("salary_condition", BooleanType(), True)]), True)])

newDf = spark.createDataFrame([{'age_condition':True,'salary_condition':True}], schema=schema)

df = spark.read.json("/whatever/json/path")

df.crossJoin(newDf) #no of records is same as in df as no of records in newDf is 1

Pyspark add new record to each Row

2 Answers2

This should do what you want...