1

I have a DataFrame I'm formatting for an SciKit Learn PCA looks something like this:

datetime |  mood |  activities |  notes

8/27/2017 |  "good" | ["friends", "party", "gaming"] | NaN

8/28/2017 |  "meh" |  ["work", "friends", "good food"] | "Stuff stuff"

8/29/2017 |  "bad" |  ["work", "travel"] |  "Fell off my bike"

...and so on

I'd like to transform it to this, which I think will be better for ML work:

datetime |  mood |  friends | party | gaming | work | good food | travel |  notes

8/27/2017 |  "good" | True | True | True | False | False | False | NaN

8/28/2017 |  "meh" |  True | False | False | True | True | False | "Stuff stuff"

8/29.2017 | "bad" | False | False | False | False | True | False | True | "Fell off my bike"

I've already tried the method outlined here, which just gives me a left-justified matrix of all the activities. The columns have no meaning. If I try and pass columns to the DataFrame constructor, I get an error "26 columns passed, passed data had 9 columns. I believe that's because even though I have 26 discrete events, the most I've ever done in a simultaneous day is 9. Is there a way I can have it fill with 0/False if the column isn't found in that particular row? Thanks.

tel
  • 13,005
  • 2
  • 44
  • 62
sawyermclane
  • 896
  • 11
  • 28
  • Before you turn it into a dataframe (or try to), what's the structure of the data? 3 lists and a list-of-lists (for activities)? – tel Dec 18 '18 at 02:20
  • Well I just got flamed for writing a Perl script to do this (bad me! How dare I!). Okay now censored. Anyway its very trivial, even from base coding, construct a hash (oopss dictionaries), loop around the keys (I'm okay with that word right), into which you place an "if" statement. – M__ Dec 18 '18 at 02:42
  • @tel it's a csv export from a mental health app. I've had to manually do something like this before, with another similarly formatted file. It just seems like something that either pandas or sklearn might have shortcuts for. – sawyermclane Dec 18 '18 at 02:53

2 Answers2

6

You can simply use get_dummies

lets assume this dataframe:

df = pd.DataFrame({'datetime':pd.date_range('2017-08-27', '2017-08-29'),
              'mood':['good','meh','bad'],'activities':[['friends','party','gaming'],
                                                        ["work", "friends", "good food"],
                                                        ["work", "travel"]],
              'notes':[np.nan, 'stuff stuff','fell off my bike']})
df.set_index(['datetime'], inplace=True)

            mood      activities                notes
datetime            
2017-08-27  good    [friends, party, gaming]    NaN
2017-08-28  meh     [work, friends, good food]  stuff stuff
2017-08-29  bad     [work, travel]              fell off my bike

just concat and get_dummies:

df2 = pd.concat([df[['mood','notes']], pd.get_dummies(df['activities'].apply(pd.Series),
                                                      prefix='activity')], axis=1)


            mood    notes   activity_friends    activity_work   activity_friends    activity_party  activity_travel activity_gaming activity_good food
datetime                                    
2017-08-27  good    NaN             1               0                 0                 1                   0                   1                   0
2017-08-28  meh     stuff stuff     0               1                 1                 0                   0                   0                   1
2017-08-29  bad    fell off my bike 0               1                 0                 0                   1                   0                   0

You change change them to booleans if you want using loc:

df2.loc[:,df2.columns[2:]] = df2.loc[:,df2.columns[2:]].astype(bool)
It_is_Chris
  • 13,504
  • 2
  • 23
  • 41
  • 1
    i am getting error while executing above code. KeyError: "['mood' 'notes'] not in index" – LOrD_ARaGOrN Dec 18 '18 at 08:47
  • 1
    What does df.columns show? Make sure it is indeed ‘mood’ and not something like ‘mood ‘ – It_is_Chris Dec 18 '18 at 12:07
  • I think this is incorrect, for example if you look at the column "activity_friends", there should be a 1 for "good" and "meh". – kdd Feb 15 '20 at 18:50
  • @kdd no, it is correct; there are just two `activity_friends ` columns. If you want them concatenated simply `df2.groupby(df2.columns, axis=1).sum()` – It_is_Chris Feb 17 '20 at 17:38
  • Oh, I'm blind... I swear I looked for that. Thanks! – kdd Feb 17 '20 at 21:44
2

Here's a complete solution, parsing of the messy output and all:

from ast import literal_eval
import numpy as np
import pandas as pd

# the raw data

d = '''datetime |  mood |  activities |  notes

8/27/2017 |  "good" | ["friends", "party", "gaming"] | NaN

8/28/2017 |  "meh" |  ["work", "friends", "good food"] | "Stuff stuff"

8/29/2017 |  "bad" |  ["work", "travel"] |  "Fell off my bike"'''

# parse the raw data
df = pd.read_csv(pd.compat.StringIO(d), sep='\s*\|\s*', engine='python')

# parse the lists of activities (which are still strings)
acts = df['activities'].apply(literal_eval)

# get the unique activities
actcols = np.unique([a for al in acts for a in al])

# assemble the desired one hot array from the activities
actarr = np.array([np.in1d(actcols, al) for al in acts])
actdf = pd.DataFrame(actarr, columns=actcols)

# stick the dataframe with the one hot array onto the main dataframe
df = pd.concat([df.drop(columns='activities'), actdf], axis=1)

# fancy print
with pd.option_context("display.max_columns", 20, 'display.width', 9999):
    print(df)

Output:

    datetime    mood               notes  friends  gaming  good food  party  travel   work
0  8/27/2017  "good"                 NaN     True    True      False   True   False  False
1  8/28/2017   "meh"       "Stuff stuff"     True   False       True  False   False   True
2  8/29/2017   "bad"  "Fell off my bike"    False   False      False  False    True   True
tel
  • 13,005
  • 2
  • 44
  • 62
  • This is it. I was specifically looking for something like `actrows = np.array([np.in1d(actdf.columns.values, a) for a in acts]); actdf = pd.DataFrame(actrows, columns=actdf.columns)` – sawyermclane Dec 18 '18 at 03:55
  • 1
    for some reason I get error in _acts = df['activities'].apply(pd.eval)_, and I don't know what is al in the line actcols = np.unique([a for al in acts for a in al]) – user1889297 Dec 18 '18 at 06:01
  • 1
    Odd. You'll probably be able to fix your issue by using the builtin `eval` instead of `pd.eval`. I actually edited the answer from one to the other since `pd.eval` should be a little more limited/safer while still getting the job done. If I were you, I would also check the version of my `pandas` package. It may be that upgrading will fix the bug (the latest version is `v0.23.4`) – tel Dec 18 '18 at 06:04
  • 1
    The line that creates `actcols` is an example of a [nested list comprehension](https://spapas.github.io/2016/04/27/python-nested-list-comprehensions/). `acts` is a list-of-lists (really a `pd.Series`-of-lists, but whatever), `al` is a single sublist, and `a` is a single activity. – tel Dec 18 '18 at 06:08