0

I have a dataframe df. Most of the columns are json strings while some are list of jsons. The preview of sample rows is shown below:

   id     movie     genres  
    1     John      [{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}, {'id': 878, 'name': 'Science Fiction'}]
    2     Mike      [{'id': 28, 'name': 'Action'}]
    3     Jerry      []

As visible above, genres column has different quantity of items.

I want to extract only values from the keys called 'name' and put them into separate columns. So, for example, if there are three 'name' keys then there will need to be 3 separate columns to store the respective values (the 'name' is the genre). So the new columns could be called 'genre1', 'genre2' etc.

I only need 4 columns max for 4 genres only!

I tried this code:

pd.concat([df['genres'].apply(pd.Series)], axis=1)

it gave me the output I didn't need.

The output should be:

    id    movie     genre1     genre2          genre3
    1     John      Action      Adventure     Science Fiction
    2     Mike      Action
    3     Jerry      None
  • 2
    Can you provide some expected output please? Also, it will help having the whole structure of the dataframe provided. – Danail Petrov Jan 02 '21 at 22:56
  • Edited! Hope it helps! – codemunchkin Jan 05 '21 at 01:51
  • Can you provide an example of entries that are 'json strings'? To improve your question and make the example in it reproducible you can take a look here: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – PieCot Jan 06 '21 at 14:45

2 Answers2

0

This code;

import pandas as pd
a = {'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}, {'id': 878, 'name': 'Science Fiction'}, {'id': 10749, 'name': 'Romance'}
b = {'id': 12, 'name': 'Adventure'}, {'id': 878, 'name': 'Science Fiction'}, {'id': 5, 'name': 'XXX'}
df = pd.DataFrame([{'x':1, 'genres':a}, {'x':2, 'genres':b}])

def mini_explode(r):
    genres = [e['name'] for e in r['genres']]
    for i, g in enumerate(genres):
        r[f'genre_{i}'] = g
    del r['genres']
    return r
    
print(df.apply(mini_explode, axis=1))

will result:


     genre_0          genre_1          genre_2  genre_3  x
0     Action        Adventure  Science Fiction  Romance  1
1  Adventure  Science Fiction              XXX      NaN  2

but sure it will be costly and very slow on big data frames. Please consider redesigning your data structure.

armamut
  • 1,087
  • 6
  • 14
0

I assume that the column 'genres' in your input DataFrame is a string with a serialized json, containing a list with a variable number of genres for each row.

In this case, you have deserialize each entry before analyzing it:

genres_deserialized = df['genres'].map(json.loads)

Then you have to add the columns parsing the deserialized data:

for i in range(n_genres):
    df[f'genres{i + 1}'] = genres_deserialized.map(lambda x: x[i]['name'] if i < len(x) else None)

Here an example:

import pandas as pd
import json

# Assume that the column 'genres' contains strings that are serialized json
df = pd.DataFrame(
    data=[
        [1, 'John', json.dumps([{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}, {'id': 878, 'name': 'Science Fiction'}])],
        [2, 'Mike', json.dumps([{'id': 28, 'name': 'Action'}])],
        [3, 'Jerry', json.dumps([])],
    ],
    columns=['id', 'movie', 'genres']
)

# Deserialize json
genres_deserialized = df['genres'].map(json.loads)

# If you want to compute the maximum number of available genres per movie at runtime
# n_genres = genres_deserialized.map(lambda x: len(x)).max()

# Otherwise, you can set the number of genres.
n_genres = 4
# In this case, you should choose which genres must be kept in case they are more than you need. Otherwise, the first n_genres will be selected.
# For example, choose the first n_genres by name in alphabetical order.
# genres_deserialized = genres_deserialized.map(lambda x: sorted(x, key=lambda k: k['name']))

for i in range(n_genres):
    df[f'genres{i + 1}'] = genres_deserialized.map(lambda x: x[i]['name'] if i < len(x) else None)

The previous code produces this output:

   id  movie genres1    genres2          genres3 genres4
0   1   John  Action  Adventure  Science Fiction    None
1   2   Mike  Action       None             None    None
2   3  Jerry    None       None             None    None
PieCot
  • 3,564
  • 1
  • 12
  • 20
  • This code gives an error which says, "string indices must be integers." I have updated the question as well for more clarity. – codemunchkin Jan 06 '21 at 14:29
  • I've updated the answer, trying to guess the types and the format of your input DataFrame. – PieCot Jan 06 '21 at 15:21