1

Here's the dataset I'm currently trying to work with:

enter image description here

What I'm trying to do is separate all the 'genres' values by the comma, and then create new columns called 'genres_Comedy', 'genres_Drama', 'genres_Family' etc where each row will have the value of 0 or 1 depending on if it is that genre. I'm fairly certain it's possible, I know how to separate columns by deliminator, but I don't know how to generate the necessary columns based on the split strings, or how I would then add each rows correct value (0 or 1) with the newly generated columns.

I've tried to look for solutions, but my problem is a bit specific and I can't find any applicable solutions, though maybe I'm looking wrong. Does anyone know how I can go about accomplishing this? Please let me know if there's any other info I can provide that could be helpful, thanks for reading.

Marcus
  • 261
  • 1
  • 4
  • 14
  • 1
    You're more likely to get an answer if you can provided a sample dataframe in your question instead of just an image of it. – Marcelo Paco Mar 18 '23 at 14:12
  • Does this answer your question? [Splitting a pandas dataframe column by delimiter](https://stackoverflow.com/questions/37333299/splitting-a-pandas-dataframe-column-by-delimiter) – ti7 Mar 18 '23 at 14:13
  • It doesn't unfortunately, I had checked that one out before even, but it only does a part of what I needed. Alan's answer below answers it though and the problem is now solved. Thank you for checking and helping though, I appreciate it. – Marcus Mar 18 '23 at 15:13

1 Answers1

1

Sure, this is very doable, but takes a couple steps.

Step 1

First of all, convert your string column into list of strings. df["genres"] = df.genres.apply(lambda x: x.split(","), axis=1). (Note, if it's possible that genres ends with a , then you'll need to remove that or end up with "" empty string for a genre)

Now the general strategy is to make a new dataframe out of it this list column (with same number of rows, and with columns like Genre_Comedy etc), and concat that back on to the original.

Step 2

Next: create the new frame from the genres column via genres = df.genres.apply(pd.Series).stack(). The apply(pd.Series) converts the list into pd.Series, and then stack() breaks that into a multi-index Series.

Step 3

Next we use get_dummies; this is the usual way of creating multiple columns out of one column. If your column is animal and one row has cat and another has dog, then you'll end up with two new columns; one called animal_cat and the other animal_dog. Same thing here, except now we are working on a multi-index Series. Use it like this: genres = pd.get_dummies(genres). Now you'll have a multi-index dataframe with all your genre columns (Horror, Comedy, etc). Finally just collapse the multi-index via genres = genres.sum(level=0).

Step 4

Finally concat it back on via df = df.concat(genres, axis=1)

Done!

I highly recommend printing the output after each line of code to see how the shape of your data evolves. Very hard to understand without visualising it.

Alan
  • 1,746
  • 7
  • 21
  • There is a method `Series.str.get_dummies` which should be able to do steps one to three at once. – Michael Butscher Mar 18 '23 at 14:51
  • Thank you! This works, though I did have to drop the ````, axis =1```` from step 1, as it was returning: ````TypeError: () got an unexpected keyword argument 'axis'```` and also reformat Step 4 to ````testing_df = pd.concat([testing_df, genres], axis=1)```` as it was otherwise returning: ````AttributeError: 'DataFrame' object has no attribute 'concat'````. But really thank you, I understand the process a lot better now. Also thank you Michael, that's good to know and I'm looking into it now. – Marcus Mar 18 '23 at 15:10