I am trying to make different columns from separated strings. My datasource is the https://grouplens.org/datasets/movielens/ ml-latest-small.zip (size: 1 MB)
movie_df = pd.read_csv('movies.csv')
movie_df.head(10)
Reading in the file, I have raw dataframe
I tried to do
movies_df = pd.read_csv('movies.csv', sep='|', encoding='latin-1',
names=['movie_id', 'movie_title','unknown', 'action','adventure', 'animation', 'childrens', 'comedy', 'crime', 'documentary', 'drama', 'fantasy','film_noir', 'horror', 'musical', 'mystery', 'romance', 'sci_fi', 'thriller', 'war', 'western'])
movies_df.head(10)
but this squishes everything before the separator to the first column and the first split on my genre also goes to the first column. Otherwise, it is what I need. See here.
How do I get all my genres of varying lengths to become a unique column after movieId and title? I want each genre to be a column with NaNs if it is not that column to set up for creating dummy variables later.
Edit: I did movies_df.head(10).to_dict()
and the output was:
'title': {0: 'Toy Story (1995)',
1: 'Jumanji (1995)',
2: 'Grumpier Old Men (1995)',
3: 'Waiting to Exhale (1995)',
4: 'Father of the Bride Part II (1995)',
5: 'Heat (1995)',
6: 'Sabrina (1995)',
7: 'Tom and Huck (1995)',
8: 'Sudden Death (1995)',
9: 'GoldenEye (1995)'},
'genres': {0: 'Adventure|Animation|Children|Comedy|Fantasy',
1: 'Adventure|Children|Fantasy',
2: 'Comedy|Romance',
3: 'Comedy|Drama|Romance',
4: 'Comedy',
5: 'Action|Crime|Thriller',
6: 'Comedy|Romance',
7: 'Adventure|Children',
8: 'Action',
9: 'Action|Adventure|Thriller'}}