2

I'm trying to manipulate some CSV data, and normally I would use pandas when I have complex changes. However I have no idea how to deal with nested key values inside one or more CSV fields.

So in essence I have data like this,

+------+------+-------------------+------+------+
| col1 | col2 | col3              | col4 | col5 |
+------+------+-------------------+------+------+
| v    | v    | ncol1=nv,ncol2=nv | v    | v    |
+------+------+-------------------+------+------+
| v    | v    | ncol3=nv          | v    | v    |
+------+------+-------------------+------+------+
| v    | v    |                   | v    | v    |
+------+------+-------------------+------+------+

And I'm trying to get something like,

+------+------+-------+-------+-------+------+------+
| col1 | col2 | ncol1 | ncol2 | ncol3 | col4 | col5 |
+------+------+-------+-------+-------+------+------+
| v    | v    | nv    | nv    |       | v    | v    |
+------+------+-------+-------+-------+------+------+
| v    | v    |       |       | nv    | v    | v    |
+------+------+-------+-------+-------+------+------+
| v    | v    |       |       |       | v    | v    |
+------+------+-------+-------+-------+------+------+
  • check out `explode`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html – jsmart Aug 06 '20 at 18:00
  • Yea, I see the low level ways to get the data into a usable data frame, but I don't know how to actually read the data in. This looks similar to json_normalize, but with a csv. – Cyberpunkspike Aug 06 '20 at 18:05
  • Can you post the CSV file? This link might be useful: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – jsmart Aug 06 '20 at 18:08
  • @jsmart, this example pretty much shows exactly what you are suggesting, and what I was trying to do. https://binal.pub/2018/09/dealing-with-nested-data-in-pandas/ – Cyberpunkspike Aug 06 '20 at 18:13
  • @jsmart, I don't have two nicely separated data structures to start with though, I have the csv, which is exactly as shown in my example. How do I get that nested field into a structure, from which I could use explode. – Cyberpunkspike Aug 06 '20 at 18:14
  • @jsmart, BTW, I want to get concated columns, so explode is actually useless for me, it appears. – Cyberpunkspike Aug 06 '20 at 18:30
  • Is the value in `col3` a list or a string? – Anant Kumar Aug 06 '20 at 19:01

1 Answers1

0

Assuming that the DataFrame Values in Column C is a comma separated string, the code does the following

  1. Creates a dictionary from the comma separated string
  2. Removes all null valued rows/ empty rows present in column C so that the dictionary object previously created can be expanded
  3. Dynamically creates new columns based on the dictionary keys
  4. Expands the Dictionary
  5. Merges the null valued Dataframe and newly created Dataframe
import pandas as pd
import numpy as np
df=pd.DataFrame({"A":['a','b','c',],"B":['e','f','d'],"C":['D=nv,E=nv',np.nan,"D=nv"],})
#Converts string to dictionary of key-value pairs
df.loc[:,"C"]=df.loc[:,"C"].apply(lambda x: dict(map(lambda z: z.split('='),x.split(","))) if type(x)==str else np.nan)
#Drop all null values present in Column so that the dataframe can be expanded
#Separate the null and actual rows containing values into 2 separate dataframes
df_act=df.dropna(subset=["C"])
df_null=df[~df.index.isin(df_act.index)]
#Expand the Column and store in a temporary DataFrame
df_temp=df_act['C'].apply(pd.Series)
for cols in df_temp.columns:
    df_act.loc[:,cols]=np.nan
    df_null.loc[:,cols]=np.nan

#Save Contents in the actual DataFrame
df_act[df_temp.columns]=df_temp
#Drop C Column to match with Sample Output
df_act.drop("C", axis=1, inplace=True)
df_null.drop("C", axis=1, inplace=True)
#Concatenate the DataFrames
final_df=pd.concat([df_act, df_null])

Please note that the removal of C column is only done so that output matches with the sample output provided.

Anant Kumar
  • 611
  • 5
  • 20