0

I have movie dataset saved for revenue prediction. However, the genres column of this dataset has a dictionary in that dictionary there is 2 or more list in 1 row. The DataFrame looks like this this is not actual dataframe but dataframe is similar to this:

df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, [{'c':4},{'d':3}], [{'c':5, 'd':6},{'c':7, 'd':8}]]})

this is output

    a   b
0   1   {'c': 1}
1   2   [{'c': 4}, {'d': 3}]
2   3   [{'c': 5, 'd': 6}, {'c': 7, 'd': 8}]

I need to split this column into separate columns.

How can i do that I used apply(pd.series) method This is what I'm getting as a output

    0                   1                   c
0   NaN                 NaN                 1.0
1   {'c': 4}            {'d': 3}            NaN
2   {'c': 5, 'd': 6}    {'c': 5, 'd': 6}    NaN

but I want like this if possible:

    a   c      d
0   1   1      NaN
1   2   4      3
2   3   5,7    6,8 

chirag prajapati
  • 579
  • 6
  • 22

1 Answers1

2

I do not know if it is possible to achieve what you want by using apply(pd.Series) because you have mixed types in your 'b' column: you have dictionaries and list of dictionaries. Maybe it is, not sure.

However this is how I would do.
First, loop over your column to build a set with all the new column names: that is, the keys of the dictionaries.
Then you can use apply with a custom function to extract the value for each column.
Notice that the values in this column are strings, needed because you want to concatenate with a comma cases like your row #2.

newcols = set()
for el in df['b']:
    if isinstance(el, dict):
        newcols.update(el.keys())
    elif isinstance(el, list):
        for i in el:
            newcols.update(i.keys())

def extractvalues(x, col):
    if isinstance(x['b'], dict):
        return x['b'].get(col, np.nan)
    elif isinstance(x['b'], list):
        return ','.join(str(i.get(col, '')) for i in x['b']).strip(',')

for nc in newcols:
    df[nc] = df.apply(lambda r: extractvalues(r, nc), axis=1)

df.drop('b', axis=1, inplace=True)

Your dataframe is now:

   a    c    d
0  1    1  NaN
1  2    4    3
2  3  5,7  6,8
Valentino
  • 7,291
  • 6
  • 18
  • 34
  • hey your code was working fine like i wanted but i don't know why it is not working in if i converted the same dataframe into csv and then tried to do that same can you show me how to do this same if i have csv file because in csv file it's not working – chirag prajapati Jun 12 '19 at 07:14
  • Sorry, I don't get your question. Once you save a dataframe in a csv, you get a file. You can't use pandas directly on a file. You need to read the content back into pandas. Maybe something went wrong during file writing or reading. I suggest you to open a new question and to explain in details what you were doing and why it does not work, what errors you get. etc. Feel free to link this answer if it is helpful to understand the problem. – Valentino Jun 12 '19 at 09:53