18

I just discovered the json_normalize function which works great in taking a JSON object and giving me a pandas Dataframe. Now I want the reverse operation which takes that same Dataframe and gives me a json (or json-like dictionary which I can easily turn to json) with the same structure as the original json.

Here's an example: https://hackersandslackers.com/json-into-pandas-dataframes/.

They take a JSON object (or JSON-like python dictionary) and turn it into a dataframe, but I now want to take that dataframe and turn it back into a JSON-like dictionary (to later dump to json file).

Parsa T
  • 411
  • 3
  • 9
  • this is not work in this case: d={'a': 1, 'd': [1, '#d_i1', 3], 'c_a': '#a_val', 'c_b_x': '#x_value', 'c_b_y': '#y'} the output is: {'a': 1, 'd': 1, 'c': {'a': '#a_val', 'b': {'x': '#x_value', 'y': '#y'}}} – alihejrati Dec 02 '22 at 12:13

8 Answers8

13

I implemented it with a couple functions

def set_for_keys(my_dict, key_arr, val):
    """
    Set val at path in my_dict defined by the string (or serializable object) array key_arr
    """
    current = my_dict
    for i in range(len(key_arr)):
        key = key_arr[i]
        if key not in current:
            if i==len(key_arr)-1:
                current[key] = val
            else:
                current[key] = {}
        else:
            if type(current[key]) is not dict:
                print("Given dictionary is not compatible with key structure requested")
                raise ValueError("Dictionary key already occupied")

        current = current[key]

    return my_dict

def to_formatted_json(df, sep="."):
    result = []
    for _, row in df.iterrows():
        parsed_row = {}
        for idx, val in row.iteritems():
            keys = idx.split(sep)
            parsed_row = set_for_keys(parsed_row, keys, val)

        result.append(parsed_row)
    return result


#Where df was parsed from json-dict using json_normalize
to_formatted_json(df, sep=".")
Parsa T
  • 411
  • 3
  • 9
6

A simpler approach:
Uses only 1 function...

def df_to_formatted_json(df, sep="."):
    """
    The opposite of json_normalize
    """
    result = []
    for idx, row in df.iterrows():
        parsed_row = {}
        for col_label,v in row.items():
            keys = col_label.split(sep)

            current = parsed_row
            for i, k in enumerate(keys):
                if i==len(keys)-1:
                    current[k] = v
                else:
                    if k not in current.keys():
                        current[k] = {}
                    current = current[k]
        # save
        result.append(parsed_row)
    return result
Yaakov Bressler
  • 9,056
  • 2
  • 45
  • 69
2
df.to_json(path)

or

df.to_dict()
Dallas Lindauer
  • 246
  • 1
  • 4
  • 5
    Except I want to retrieve the original JSON structure. `json_normalize` creates dataframe columns with names like album.artist and album.href if artist and href are properties in the album object. This gives me a dictionary with album.artist and album.href as keys where I need an album key with an artist and href keys – Parsa T Feb 20 '19 at 00:32
  • Can you provide some more specific information about the data? Also, take a look at the orient parameter for the to_json method. It might have what you're looking for. – Dallas Lindauer Feb 20 '19 at 00:40
  • 1
    The data is almost identical to the example posted in the link, specifically the first dataframe shown. And I've seen the orient parameter but it doesn't infer json structure from column names. If there's no existing function which does this, I'll probably write it myself and post it up at some point – Parsa T Feb 20 '19 at 01:10
1

I just implemented this using 2 functions.

  1. Get a full list of fields from the DataFrame that are part of a nested field. Only the parent i.e. if location.city.code fits the criteria, we only care about location.city. Sort it by the deepest level of nesting, i.e. location.city is nested further than location.
  2. Starting with the deepest nested parent field, find all child fields by searching in the column name. Create a field in the DataFrame for the parent field, which is a combination of all child fields (renamed so that they lose the nesting structure, e.g. location.city.code becomes code) converted to JSON and then loaded to a dictionary value. Finally, drop all of the child fields.
def _get_nested_fields(df: pd.DataFrame) -> List[str]:
    """Return a list of nested fields, sorted by the deepest level of nesting first."""
    nested_fields = [*{field.rsplit(".", 1)[0] for field in df.columns if "." in field}]
    nested_fields.sort(key=lambda record: len(record.split(".")), reverse=True)
    return nested_fields


def df_denormalize(df: pd.DataFrame) -> pd.DataFrame:
    """
    Convert a normalised DataFrame into a nested structure.

    Fields separated by '.' are considered part of a nested structure.
    """
    nested_fields = _get_nested_fields(df)
    for field in nested_fields:
        list_of_children = [column for column in df.columns if field in column]
        rename = {
            field_name: field_name.rsplit(".", 1)[1] for field_name in list_of_children
        }
        renamed_fields = df[list_of_children].rename(columns=rename)
        df[field] = json.loads(renamed_fields.to_json(orient="records"))
        df.drop(list_of_children, axis=1, inplace=True)
    return df
0

let me throw in my two cents

after backward converting you might need to drop empty columns from your generated jsons therefore, i checked if val != np.nan. but u cant directly do it, instead you need to check val == val or not, because np.nan != itself. my version:

def to_formatted_json(df, sep="."):
    result = []
    for _, row in df.iterrows():
        parsed_row = {}
        for idx, val in row.iteritems():
            if val == val:
                keys = idx.split(sep)
                parsed_row = set_for_keys(parsed_row, keys, val)

        result.append(parsed_row)
    return result
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
slav
  • 11
  • 3
0

This is a solution which looks working to me. It is designed to work on a dataframe with one line, but it can be easily looped over large dataframes.

class JsonRecreate():
    
    def __init__(self, df):
        self.df = df

    def pandas_to_json(self):
        df = self.df
        # determine the number of nesting levels
        number_levels = np.max([len(i.split('.')) for i in df.columns])
        # put all the nesting levels in an a list
        levels = []
        for level_idx in np.arange(number_levels):
            levels.append(np.array([i.split('.')[level_idx] if len(i.split('.')) > level_idx else ''
                                    for i in df.columns.tolist()]))
        self.levels = levels
        return self.create_dict(upper_bound = self.levels[0].shape[0])

    def create_dict(self, level_idx = 0, lower_bound = 0, upper_bound = 100):
        ''' Function to create the dictionary starting from a pandas dataframe generated by json_normalize '''
        levels = self.levels
        dict_ = {}
        # current nesting level
        level = levels[level_idx]
        # loop over all the relevant elements of the level (relevant w.r.t. its parent)
        for key in [i for i in np.unique(level[lower_bound: upper_bound]) if i != '']:
            # find where a particular key occurs in the level
            correspondence = np.where(level[lower_bound: upper_bound] == key)[0] + lower_bound
            # check if the value(s) corresponding to the key appears once (multiple times)
            if correspondence.shape[0] == 1:
                # if the occurence is unique, append the value to the dictionary
                dict_[key] = self.df.values[0][correspondence[0]]
            else:
                # otherwhise, redefine the relevant bounds and call the function recursively
                lower_bound_, upper_bound_ = correspondence.min(), correspondence.max() + 1
                dict_[key] = self.create_dict(level_idx + 1, lower_bound_, upper_bound_)
        return dict_

I tested it with a simple dataframe such as:

df = pd.DataFrame({'a.b': [1], 'a.c.d': [2], 'a.c.e': [3], 'a.z.h1': [-1], 'a.z.h2': [-2], 'f': [4], 'g.h': [5], 'g.i.l': [6], 'g.i.m': [7], 'g.z.h1': [-3], 'g.z.h2': [-4]})

The order in the json is not exactly preserved in the resulting json, but it can be easily handled if needed.

0

Seems to work just fine.

df.to_json(orient='records')

  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Apr 07 '23 at 00:30
0

This is another example for a reverse pd.json_normalize(). It uses deepmerge in case of multiple levels of nested data. It takes a pandas dataframe and returns a nested dict, that can easily turned into a json using json.dumps.

from deepmerge import always_merger

def df_denormalize(df: pd.DataFrame, sep: str = '.') -> dict:
    # this is the opposite of pd.json_normalize

    denormalized_dict = []
    for _, row in df.iterrows():
        nestedRow = {}
        for (col, val) in row.items():
            if val == val:  # skip nans
                splitColumn = col.split(sep)
                if len(splitColumn) < 1:
                    continue
                else:
                    tmp = val
                    for newCol in splitColumn[::-1]:
                        tmp = {newCol: tmp}
                    nestedRow = always_merger.merge(nestedRow, tmp)
        denormalized_dict.append(nestedRow)

    return denormalized_dict
mik
  • 1
  • 1