Flatten Nested Spark Dataframe

Question

Is there a way to flatten an arbitrarily nested Spark Dataframe? Most of the work I'm seeing is written for specific schema, and I'd like to be able to generically flatten a Dataframe with different nested types (e.g. StructType, ArrayType, MapType, etc).

Say I have a schema like:

StructType(List(StructField(field1,...), StructField(field2,...), ArrayType(StructType(List(StructField(nested_field1,...), StructField(nested_field2,...)),nested_array,...)))

Looking to adapt this into a flat table with a structure like:

field1
field2
nested_array.nested_field1
nested_array.nested_field2

FYI, looking for suggestions for Pyspark, but other flavors of Spark are also appreciated.

What do you mean by "flatten a Dataframe with different nested types (e.g. StructType, ArrayType, MapType"? I am not that naive to expect to see some code but least good description of the problem with examples could be useful. — zero323, Dec 14 '15 at 16:13

score 24 · Answer 1 · edited May 08 '19 at 18:31

24

This issue might be a bit old, but for anyone out there still looking for a solution you can flatten complex data types inline using select *:

first let's create the nested dataframe:

from pyspark.sql import HiveContext
hc = HiveContext(sc)
nested_df = hc.read.json(sc.parallelize(["""
{
  "field1": 1, 
  "field2": 2, 
  "nested_array":{
     "nested_field1": 3,
     "nested_field2": 4
  }
}
"""]))

now to flatten it:

flat_df = nested_df.select("field1", "field2", "nested_array.*")

You'll find useful examples here: https://docs.databricks.com/delta/data-transformation/complex-types.html

If you have too many nested arrays, you can use:

flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
flat_df = nested_df.select(*flat_cols, *[c + ".*" for c in nested_cols])

edited May 08 '19 at 18:31

Even Mien

44,393
43
115
119

answered Aug 15 '17 at 14:18

MaFF

9,551
2
32
41

1

Hi @MaFF, Your solution is really helpful. I have a query suppose in the example you provided if nested_array is array> then how can i have nested_field1 and nested_field2 in separate columns. I can have multiple structs with same key fields and different values. Couldn't solve it. It would be really helpful if you can provide some insight – Saikat May 22 '20 at 18:27
Is there a possibility to do this recursively, where structs are even more nested? Any tips? – Sip Jun 21 '22 at 09:43
This doesn't work for nested json. Have to solve it using recursion. – Bharath Ram Jul 16 '22 at 17:17

Narahari B M · Answer 2 · 2022-07-26T10:31:47.420

This flattens nested df that has both struct types and array types. Typically helps when reading data in through Json. Improved on this https://stackoverflow.com/a/56533459/7131019

from pyspark.sql.types import *
from pyspark.sql import functions as f

def flatten_structs(nested_df):
    stack = [((), nested_df)]
    columns = []

    while len(stack) > 0:
        
        parents, df = stack.pop()
        
        array_cols = [
            c[0]
            for c in df.dtypes
            if c[1][:5] == "array"
        ]
        
        flat_cols = [
            f.col(".".join(parents + (c[0],))).alias("_".join(parents + (c[0],)))
            for c in df.dtypes
            if c[1][:6] != "struct"
        ]

        nested_cols = [
            c[0]
            for c in df.dtypes
            if c[1][:6] == "struct"
        ]
        
        columns.extend(flat_cols)

        for nested_col in nested_cols:
            projected_df = df.select(nested_col + ".*")
            stack.append((parents + (nested_col,), projected_df))
        
    return nested_df.select(columns)

def flatten_array_struct_df(df):
    
    array_cols = [
            c[0]
            for c in df.dtypes
            if c[1][:5] == "array"
        ]
    
    while len(array_cols) > 0:
        
        for array_col in array_cols:
            
            cols_to_select = [x for x in df.columns if x != array_col ]
            
            df = df.withColumn(array_col, f.explode(f.col(array_col)))
            
        df = flatten_structs(df)
        
        array_cols = [
            c[0]
            for c in df.dtypes
            if c[1][:5] == "array"
        ]
    return df

flat_df = flatten_array_struct_df(df)

**

score 2 · Answer 3 · answered Jan 06 '16 at 13:48

Here's my final approach:

1) Map the rows in the dataframe to an rdd of dict. Find suitable python code online for flattening dict.

flat_rdd = nested_df.map(lambda x : flatten(x))

where

def flatten(x):
  x_dict = x.asDict()
  ...some flattening code...
  return x_dict

2) Convert the RDD[dict] back to a dataframe

flat_df = sqlContext.createDataFrame(flat_rdd)

Igor Tavares · Answer 4 · 2020-12-31T13:05:19.947

2

I've developed a recursively approach to flatten any nested DataFrame.

The implementation is on the AWS Data Wrangler code base on GitHub.

P.S. The Spark support was deprecated in the package, but the code base stills useful.

edited Dec 31 '20 at 13:05

answered Oct 18 '19 at 01:35

Igor Tavares

869
11
8

1

Page not found!! – Mohammad Rijwan Dec 30 '20 at 06:48
2

It is very recommended to put the code directly here instead of a link. – Itération 122442 Oct 11 '22 at 12:31

score 1 · Answer 5 · answered Mar 15 '23 at 14:34

I wrote it the following way:

def to_flatten(df):
   for type in df.schema:
       if type.needConversion():
           df = df.withColumn(f"{type.name}.<<your_inner_column>>", 
                             df[f"{type.name}.<<your_inner_column>>"])
   return df

The solution does not drop existing columns.

The nested columns are of type StructType and for StructType the needConversion() method returns True.

(Beware that for some other types needConversion() returns True, however they were not part of my dataframe)

For me it yields similar solution as the one that used a stack: https://stackoverflow.com/a/65256632/21404451.

bhavin tandel · Answer 6 · 2019-03-15T09:56:02.377

The following gist will flatten the structure of the nested json,

import typing as T

import cytoolz.curried as tz
import pyspark


def schema_to_columns(schema: pyspark.sql.types.StructType) -> T.List[T.List[str]]:
    """
    Produce a flat list of column specs from a possibly nested DataFrame schema
    """

    columns = list()

    def helper(schm: pyspark.sql.types.StructType, prefix: list = None):

        if prefix is None:
            prefix = list()

        for item in schm.fields:
            if isinstance(item.dataType, pyspark.sql.types.StructType):
                helper(item.dataType, prefix + [item.name])
            else:
                columns.append(prefix + [item.name])

    helper(schema)

    return columns

def flatten_frame(frame: pyspark.sql.DataFrame) -> pyspark.sql.DataFrame:

    aliased_columns = list()

    for col_spec in schema_to_columns(frame.schema):
        c = tz.get_in(col_spec, frame)
        if len(col_spec) == 1:
            aliased_columns.append(c)
        else:
            aliased_columns.append(c.alias(':'.join(col_spec)))

    return frame.select(aliased_columns)

You can then flatten the nested data as

flatten_data = flatten_frame(nested_df)

This will give you the flatten dataframe.

The gist was taken from https://gist.github.com/DGrady/b7e7ff3a80d7ee16b168eb84603f5599

Flatten Nested Spark Dataframe

6 Answers6

Linked