How to log the size of each created dataframe in pandas

Question

I am looking for a better way to write this code, possibly similar to the method used in context manager based decorator syntax for a code block

Currently, for each new data frame or data frame view created, the shape is logged to track any logical errors resulting in missing data. It would be useful for any case where I am doing automated processing on data to identify where data disappears in the script if it does.

def process_data(frame):

    shape = {}
    shape['original'] = frame.shape

    errors = frame[frame['SHIFT'].str.len() >2]
    shape['errors'] = errors.shape

    ok = frame[frame['SHIFT'].str.len() <3]
    shape['ok'] = ok.shape

    merge_list  = [v for v in (errors,ok) if v is not None]

    healed = pd.concat(merge_list)

    shape['healed'] = healed.shape

    if shape['healed'][0] != shape['original'][0] or shape['healed'][1] != shape['original'][1]:
        raise ValueError(f"Some data loss \n{shape}")
    return healed

I would prefer to run a process with syntax similar to this.

def process_data(frame):

    with shape_info:
        frame = frame
        errors = frame[frame['SHIFT'].str.len() >2]
        ok = frame[frame['SHIFT'].str.len() <3]
        merge_list  = [v for v in (errors,ok) if v is not None]
        healed = pd.concat(merge_list)
    
    if shape_info.first()!=shape_info.last():
        raise ValueError(f"Some data loss \n{shape_info}")
    return healed

Is the context manager a good way to track this?

How to log the size of each created dataframe in pandas

0 Answers0