I have a DataFrame I'm formatting for an SciKit Learn PCA looks something like this:
datetime | mood | activities | notes
8/27/2017 | "good" | ["friends", "party", "gaming"] | NaN
8/28/2017 | "meh" | ["work", "friends", "good food"] | "Stuff stuff"
8/29/2017 | "bad" | ["work", "travel"] | "Fell off my bike"
...and so on
I'd like to transform it to this, which I think will be better for ML work:
datetime | mood | friends | party | gaming | work | good food | travel | notes
8/27/2017 | "good" | True | True | True | False | False | False | NaN
8/28/2017 | "meh" | True | False | False | True | True | False | "Stuff stuff"
8/29.2017 | "bad" | False | False | False | False | True | False | True | "Fell off my bike"
I've already tried the method outlined here, which just gives me a left-justified matrix of all the activities. The columns have no meaning. If I try and pass columns
to the DataFrame
constructor, I get an error "26 columns passed, passed data had 9 columns. I believe that's because even though I have 26 discrete events, the most I've ever done in a simultaneous day is 9. Is there a way I can have it fill with 0/False if the column isn't found in that particular row? Thanks.