2

I have a dataframe df with the following schema:

root
 |-- city_name: string (nullable = true)
 |-- person: struct (nullable = true)
 |    |-- age: long (nullable = true)
 |    |-- name: string (nullable = true)

What I want to do is add a nested column, say car_brand to my person structure. How would I do it?

The expected final schema would look like this:

root
 |-- city_name: string (nullable = true)
 |-- person: struct (nullable = true)
 |    |-- age: long (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- car_brand: string (nullable = true)
zero323
  • 322,348
  • 103
  • 959
  • 935
belka
  • 1,480
  • 1
  • 18
  • 31

3 Answers3

4

You can unpack the struct and add it to a new one, including the new column at the same time. For example, adding "bmw" to all persons in the dataframe be done like this:

df.withColumn("person", struct($"person.*", lit("bmw").as("car_brand")))
Shaido
  • 27,497
  • 23
  • 70
  • 73
  • This solution does not work with three level nested fields. How can we add `nested_col` field here? root |-- city_name: string (nullable = true) |-- person: struct (nullable = true) | |-- age: long (nullable = true) | |-- name: string (nullable = true) |-- nested_col: string (nullable = true) – Sindhu May 01 '18 at 00:19
  • @Sindhu If I understand correctly you want to add the `nested_col` to the person struct? You can do this by using the method in the answer, just change`lit("bmw")` to `$"nested_col"`. – Shaido May 01 '18 at 01:19
  • No, Actually I want to add a column when there is one more level of nesting per say `name` here, assuming `name` is struct here just like `person`. – Sindhu May 01 '18 at 01:36
  • @Sindhu If there isn't too many columns then you could do something like follows: `withColumn("person", struct($"person.age", struct($"person.name", $"nested_col").as("name)))`. However, this will change a bit depending on the type of the column (here your name column is a string, not a struct). If you want a more detailed answer for your use case, I would recommend asking a new question on the site with more information (you can leave the link in a comment here). – Shaido May 01 '18 at 01:51
2

Adding a new nested column within person:

df = df.withColumn(
        "person",
        struct(
            $"person.*",
            struct(
                lit("value_1").as("person_field_1"),
                lit("value_2").as("person_field_2"),
            ).as("nested_column_within_person")
       )
    )

Final schema :

root
 |-- city_name: string (nullable = true)
 |-- person: struct (nullable = true)
 |    |-- age: long (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- nested_column_within_person: struct (nullable = true)
 |    |    |-- person_field_1: string (nullable = true)
 |    |    |-- person_field_2: string (nullable = true)
Vijayant
  • 612
  • 2
  • 6
  • 17
  • Is there any syntax changed in the recent versions? i'm getting SyntaxError: invalid syntax for cdc_df.withColumn("AuditPayload", struct( lit("value_1").as("person_field_1"),lit("value_2").as("person_field_2"),).as("nested_column_within_person"))) – Sads Mar 18 '20 at 02:18
1
import pyspark.sql.functions as func
dF = dF.withColumn(
        "person",
   func.struct(
            "person.age",
                func.struct(
                            "person.name",
                            func.lit(None).alias("NestedCol_Name")
                    ).alias("name")
       )
       )
O/P Schema:-
root
 |-- city_name: string (nullable = true)
 |-- person: struct (nullable = false)
 |    |-- age: string (nullable = true)
 |    |-- name: struct (nullable = false)
 |    |    |-- name: string (nullable = true)
 |    |    |-- NestedCol_Name: null (nullable = true)
  • this solution will help to add the nested column at more than one level and as per above example it will add a nested column i.e. ('NestedCol_Name') inside name column which we already have. Hope it answers @Sindhu's question – user2310605 Dec 11 '20 at 11:44