1

I have a dataframe called 'games':

Game_id Goals   P_value
   1      2      0.4
   2      3      0.321
   45     0      0.64

I need to split the P value to 0.05 steps, bin the rows per P value and than create a line graph that shows the sum per p value.

What I currently have:

games.set_index('p value', inplace=True)
games.sort_index()
np.cumsum(games['goals']).plot()

But I get this:

enter image description here

No matter what I tried, I couldn't group the P values and show the sum of goals per P value.. I also tried to use matplotlib.pyplot but than I couldn't use the cumsum function..

Chen
  • 161
  • 1
  • 10
  • Please, provide a [minimal working example](https://stackoverflow.com/help/mcve). What data you have as an input and what would be the output data to plot. What solutions have you already tried? – kvoki Jul 10 '18 at 06:13
  • @black_fm - Added, thank you – Chen Jul 10 '18 at 06:31
  • It is still unclear to me what you do want to see. From your current description you do not want a cumulative sum, but rather bin rows by P value and get sum of `goals` within each bin? – kvoki Jul 10 '18 at 06:43
  • @black_fm You are right, the question itself was wrong. Indeed I want to bin the rows per P value. – Chen Jul 10 '18 at 06:45
  • Possible duplicate of [Binning column with python pandas](https://stackoverflow.com/questions/45273731/binning-column-with-python-pandas) – kvoki Jul 10 '18 at 06:55

1 Answers1

0

If I understood you correctly, you want to have discrete steps in the p-value of width 0.05 and show the cumulative sum?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


# create some random example data
df = pd.DataFrame({
    'goals': np.random.poisson(3, size=1000),
    'p_value': np.random.uniform(0, 1, size=1000)
})

# define binning in p-value
bin_edges = np.arange(0, 1.025, 0.05)
bin_center = 0.5 * (bin_edges[:-1] + bin_edges[1:])
bin_width = np.diff(bin_edges)

# find the p_value bin, each row belongs to
# 0 is underflow, len(edges) is overflow bin
df['bin'] = np.digitize(df['p_value'], bins=bin_edges)

# get the number of goals per p_value bin
goals_per_bin = df.groupby('bin')['goals'].sum()
print(goals_per_bin)

# not every bin might be filled, so we will use pandas index
# matching t
binned = pd.DataFrame({
    'center': bin_center,
    'width': bin_width,
    'goals': np.zeros(len(bin_center))
}, index=np.arange(1, len(bin_edges)))

binned['goals'] = goals_per_bin


plt.step(
    binned['center'],
    binned['goals'],
    where='mid',
)
plt.xlabel('p-value')
plt.ylabel('goals')
plt.show()
MaxNoe
  • 14,470
  • 3
  • 41
  • 46