How to get varying string splits into columns python pandas?

Question

I am trying to make different columns from separated strings. My datasource is the https://grouplens.org/datasets/movielens/ ml-latest-small.zip (size: 1 MB)

movie_df = pd.read_csv('movies.csv')
movie_df.head(10)

Reading in the file, I have raw dataframe

I tried to do

movies_df = pd.read_csv('movies.csv', sep='|', encoding='latin-1',
names=['movie_id', 'movie_title','unknown', 'action','adventure', 'animation', 'childrens', 'comedy', 'crime', 'documentary', 'drama', 'fantasy','film_noir', 'horror', 'musical', 'mystery', 'romance', 'sci_fi', 'thriller', 'war', 'western'])
movies_df.head(10)

but this squishes everything before the separator to the first column and the first split on my genre also goes to the first column. Otherwise, it is what I need. See here.

How do I get all my genres of varying lengths to become a unique column after movieId and title? I want each genre to be a column with NaNs if it is not that column to set up for creating dummy variables later.

Edit: I did movies_df.head(10).to_dict() and the output was:

 'title': {0: 'Toy Story (1995)',
  1: 'Jumanji (1995)',
  2: 'Grumpier Old Men (1995)',
  3: 'Waiting to Exhale (1995)',
  4: 'Father of the Bride Part II (1995)',
  5: 'Heat (1995)',
  6: 'Sabrina (1995)',
  7: 'Tom and Huck (1995)',
  8: 'Sudden Death (1995)',
  9: 'GoldenEye (1995)'},
 'genres': {0: 'Adventure|Animation|Children|Comedy|Fantasy',
  1: 'Adventure|Children|Fantasy',
  2: 'Comedy|Romance',
  3: 'Comedy|Drama|Romance',
  4: 'Comedy',
  5: 'Action|Crime|Thriller',
  6: 'Comedy|Romance',
  7: 'Adventure|Children',
  8: 'Action',
  9: 'Action|Adventure|Thriller'}}

Could you add the result of `movie_df.head(10).to_dict()` to your question so that we don't have to download the zip file? — Ben Grossmann, Sep 29 '22 at 22:33
I appreciate it, but I meant the raw dataframe when you read it in normally — Ben Grossmann, Sep 29 '22 at 22:45
@BenGrossmann Okay I tried again, I wrote it using the code indicators just so it listed nicely even though it's output — A Mere Pigeon, Sep 29 '22 at 22:52

Ben Grossmann · Accepted Answer · 2022-09-30T01:48:20.533

The following seems to work. That said, ideally it should be changed to avoid iterating through the genres column in order to get the list of genres, since looping through columns is slow.

movie_df = pd.read_csv('movies.csv')
genre_set = set()
for lst in movie_df['genres'].str.split('|'):
    genre_set.update(lst)
for g in genre_set:
    movie_df[g] = np.nan
    movie_df.loc[movie_df['genres'].str.contains(g),g] = g

The first for loop can be replaced with the single line genre_set.update(*movie_df['genres'].str.split('|')); I don't believe this changes performance.

The resulting frame movie_df looks like this:

                                title  \
0                    Toy Story (1995)   
1                      Jumanji (1995)   
2             Grumpier Old Men (1995)   
3            Waiting to Exhale (1995)   
4  Father of the Bride Part II (1995)   
5                         Heat (1995)   
6                      Sabrina (1995)   
7                 Tom and Huck (1995)   
8                 Sudden Death (1995)   
9                    GoldenEye (1995)   

                                        genres  Action  Romance  Thriller  \
0  Adventure|Animation|Children|Comedy|Fantasy     NaN      NaN       NaN   
1                   Adventure|Children|Fantasy     NaN      NaN       NaN   
2                               Comedy|Romance     NaN  Romance       NaN   
3                         Comedy|Drama|Romance     NaN  Romance       NaN   
4                                       Comedy     NaN      NaN       NaN   
5                        Action|Crime|Thriller  Action      NaN  Thriller   
6                               Comedy|Romance     NaN  Romance       NaN   
7                           Adventure|Children     NaN      NaN       NaN   
8                                       Action  Action      NaN       NaN   
9                    Action|Adventure|Thriller  Action      NaN  Thriller   

   Adventure  Crime  Children  Comedy  Drama  Animation  Fantasy  
0  Adventure    NaN  Children  Comedy    NaN  Animation  Fantasy  
1  Adventure    NaN  Children     NaN    NaN        NaN  Fantasy  
2        NaN    NaN       NaN  Comedy    NaN        NaN      NaN  
3        NaN    NaN       NaN  Comedy  Drama        NaN      NaN  
4        NaN    NaN       NaN  Comedy    NaN        NaN      NaN  
5        NaN  Crime       NaN     NaN    NaN        NaN      NaN  
6        NaN    NaN       NaN  Comedy    NaN        NaN      NaN  
7  Adventure    NaN  Children     NaN    NaN        NaN      NaN  
8        NaN    NaN       NaN     NaN    NaN        NaN      NaN  
9  Adventure    NaN       NaN     NaN    NaN        NaN      NaN

is df supposed to be df or is it supposed to be movie_df? When I use it as stricty df it gives me the "can't find 'df' error. But when I replace it with movie_df I get the error "A value is trying to be set on a copy of a slice from a DataFrame" — A Mere Pigeon, Sep 30 '22 at 00:39
Yes, df was supposed to be `movie_df`; should be fixed now. The "error" that you get in the second case should actually be a *warning* rather than an error; in spite of this message, you should find that `movie_df` has the correct form in the end — Ben Grossmann, Sep 30 '22 at 01:40
See [this post](https://stackoverflow.com/q/20625582/2476977) regarding the warning — Ben Grossmann, Sep 30 '22 at 01:42
I've updated the code so that the warning no longer appears. — Ben Grossmann, Sep 30 '22 at 01:48

How to get varying string splits into columns python pandas?

1 Answers1