0

Suppose I have this pandas dataframe,

    pC  Truth
0   0.601972    0
1   0.583300    0
2   0.595181    1
3   0.418910    1
4   0.691974    1

'pC' is the probability of 'Truth' being 1. 'Truth' is binary value. I want to create histogram of the probability, and inside of each bin will be the proportion 0 vs proportion 1.

I tried the following,

df[['pC','Truth']].plot(kind='hist',stacked=True)

It just put 'Truth' value between 0 and 1.

Reproducible:

shape = 1000
df_t = pd.DataFrame({'pC': np.random.rand(shape),
                     'Truth':np.random.choice([0,1],size=shape)})
df_t['factor'] = pd.cut(df_t.pC,5)

How do I do this? Thanks

Napitupulu Jon
  • 7,713
  • 3
  • 22
  • 23
  • **Post reproducible code**, use e.g. `dput(df)` – smci Sep 03 '15 at 07:35
  • I don't understand the question. How is each value of pC the probability of Truth being 1? What does each row signify? a cohort? a sample? a person? What would stacking the rows signify? – smci Sep 03 '15 at 07:37
  • You seriously have a third column, which always is 0 and hence adds no information at all? – HeinzKurt Sep 03 '15 at 07:41
  • I think the output is pretty much already tell you about df. You can copy the information, and create dataframe in `pd.read_clipboard()` Each observation is a person, and has 'truth' 1 and 0. The 1 is pretty rare, I just copy paste `df.head()`. I'll update the code. There's is no third column, if you refer to the first one, it's index. – Napitupulu Jon Sep 03 '15 at 07:45
  • @NapitupuluJon: no the output tells us nothing, and you're needlessly making it painful to reproduce. Your dataframe snippet is surely truncated, because it only contains Truth=0 entries, not Truth=1. You need to post a snippet with both Truth=0 and 1 values. Again, use `dput(df)` and post us a snippet of that. If you refuse to [post reproducible code](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), which is the basic courtesy for posting here, this question should and will be closed. – smci Sep 03 '15 at 08:22
  • That's before we get to decoding *"I want to create a histogram of the probability, and inside of each bin(?) will be the proportion 0 vs proportion 1"*. What does that mean? It sounds like you want a hybrid of a traditional histogram with probabilities binned (horizontally, as usual), but also the complementary probability (1-pC) for that bin stacked on top of each bin's probability (in stacked bar-chart manner); presumably with a white fill color. But *inside each bin* is a totally ambiguous phrase and you could mean other things. – smci Sep 03 '15 at 08:28
  • If that's what you want, I guess you could use something like `melt` to also add to the dataframe those complementary probabilities `(1-pC) for Truth==1` for each bin. Then plot as a stacked bar-chart (not as a histogram). – smci Sep 03 '15 at 08:30

2 Answers2

0

Based on my interpretation of what I think you meant:

  • create traditional histogram counts with probabilities binned (horizontally, as usual), for some binsize. Just for the dataframe as is with Truth==0
  • now augment that dataframe with complementary probability values (1-pC) for that bin for Truth==1
  • now plot the augmented df as a stacked bar-chart (presumably with a white fill color for the complementary Truth==1 bar-segments)

If you post reproducible code (use dput) and also confirm this is what you want I will post code. Otherwise, post a link to some image showing what you want.

smci
  • 32,567
  • 20
  • 113
  • 146
  • I'm not familiar with dput. But I will use reproduce it.The histogram usually only need one variable, and I'm using it with 'pC'. Should I using pd.cut then and convert it to bar chart? It's not proportion of pC vs (1-pC), but each of the bin contain the proportion of Truth 0 vs 1. In histogram where each of the bin contains multiple observation, I want it to have 'Truth proportion` 0 vs 1. Does this make sense? – Napitupulu Jon Sep 03 '15 at 09:01
  • Give me the code already. Paste the output of `dput(df)` in your question. Then we can kick around a few versions. – smci Sep 03 '15 at 09:46
0

Solved this with,

shape = 1000
df_t = pd.DataFrame({'pC': np.random.rand(shape),
                     'Truth':np.random.choice([0,1],size=shape)})
df_t['factor'] = pd.cut(df_t.pC,5)
df_p = (df_t[['factor','Truth']]
        .pivot_table(columns='Truth',index='factor',aggfunc=len,fill_value=0)
        .reset_index())
df_p[['factor',0,1]].plot(kind='bar',stacked=True,x='factor');
Napitupulu Jon
  • 7,713
  • 3
  • 22
  • 23