1

I have this DataFrame in Pandas:

                      Age InterventionType PrimaryPurpose
0         [Adult, Senior]           Device    Treatment
1  [Child, Adult, Senior]             Drug    Basic Science
2         [Adult, Senior]             Drug    Treatment
3         [Adult, Senior]              NaN            NaN
4         [Adult, Senior]            Other            NaN

And want to to pivot on InterventionType, such that I get:

Drug     Adult     2
         Senior    2
         Child     1
Device   Adult     1
         Senior    1
Other    Adult     1
         Senior    1

How do I accomplish this? Also, is it non-standard for my DataFrame to have lists? If so, what is a good practice to "de-list" the lists?

  • To de-list the lists, you should try to restructure your data into separate rows. For instance, your first row should become two rows, one with values of "Adult", "Device", and "Treatment", and one with values of "Senior", "Device", and "Treatment". You'll then have an easier time doing what you want here. – BrenBarn Jan 25 '16 at 02:26
  • I agree with BrenBarn, unfortunately this is not a super straightforward thing in pandas (there needs to be an "explode" method on dataframes). take a look at this question for ideas http://stackoverflow.com/questions/32468402/how-to-explode-a-list-inside-a-dataframe-cell-into-separate-rows – maxymoo Jan 25 '16 at 02:33
  • 1
    Thanks for the feedback. Issue is I have a lot more columns with many, many options in each column. So dividing into separate rows would explode the size of the DataFrame by many orders of magnitude (and for context, I currently have 300,000 rows of data in my data frame). Is that still the only approach? – Homan Mohammadi Jan 25 '16 at 03:02

2 Answers2

1

You can use explode() followed by groupby()

import numpy as np
import pandas as pd

Age = [["Adult", "Senior"],
        ["Child", "Adult", "Senior"],
        ["Adult", "Senior"],
        ["Adult", "Senior"],
        ["Adult", "Senior"]]
InterventionType = ["Device", "Drug", "Drug", np.NaN, "Other"]

PrimaryPurpose = ["Treatment", "Basic Science", "Treatment", np.NaN, np.NaN]

df = pd.DataFrame({"Age": Age,
                   "InterventionType": InterventionType,
                   "PrimaryPurpose": PrimaryPurpose})

df1 = df.explode("Age")


df1.groupby(["InterventionType", "Age"])[["InterventionType"]].count()\
    .rename(columns = {"InterventionType": "Count"})
Out[29]: 
                         Count
InterventionType Age          
Device           Adult       1
                 Senior      1
Drug             Adult       2
                 Child       1
                 Senior      2
Other            Adult       1
                 Senior      1
hnagaty
  • 796
  • 5
  • 13
0

Assuming that your DataFrame is called df, you can do the following,

pd.DataFrame(
    df["Age"].values.tolist(),
    index=df["InterventionType"]
).stack().reset_index().drop("level_1", axis=1).groupby(["InterventionType", 0]).size()

Will return the following,

InterventionType  0
Device            Adult     1
                  Senior    1
Drug              Adult     2
                  Child     1
                  Senior    2
Other             Adult     1
                  Senior    1
dtype: int64

This is a Pandas Series with a MultiIndex. The index names from the code above are "InterventionType" and 0, however these can be easily changed.