0

I got a scatter graph of Volume(x-axis) against Price(dMidP,y-axis) scatter plot, and I want to divide the x-axis into 30 evenly spaced sections for the entire range and average the values, then plot the average value I.e. the red dots enter image description here

however, if bin = 30: the plot only covers a small range of x bin=30

then I increase bin to 100 (the line is less smooth enter image description here

then to 500: enter image description here

do you know why is the x range changing?

------------------update-----------------------------------------

code:

df = pd.DataFrame({'X' : np.log(TradeNa['Volume']), 'Y' : TradeNa['dMidP']}) 
data_cut = pd.cut(df.X, np.linspace(df.X.min(), df.X.max(), 30))          #we cut the data following the bins
grp = df.groupby(by = data_cut)        #we group the data by the cut

ret = grp.aggregate(np.mean)         #we produce an aggregate representation (median) of each bin

plt.loglog(np.log(TradeNa['Volume']),TradeNa['dMidP'],'o')
plt.loglog(ret.X,ret.Y,'r-')

plt.show()

enter image description here

bing
  • 195
  • 2
  • 11

1 Answers1

0

pd.cut(df.X,bins) splits your data into roughly equal chunks.

I think for what you want, you need to do pd.cut(df.X, np.linspace(df.X.min(), df.X.max(), 30)) instead.

Ryan Tam
  • 845
  • 5
  • 11
  • Please do elaborate. – Ryan Tam Sep 09 '17 at 10:28
  • thanks for your reply. please see the update of the question – bing Sep 09 '17 at 15:03
  • I see what you mean now, if you want the cut in the log space then you need to take the logs in the arguments of the `np.linspace`. So something like `pd.cut(df.X, np.linspace(np.log10(df.X.min()), np.log10(df.X.max()), 30))`. – Ryan Tam Sep 09 '17 at 18:29
  • thanks, but this code gives me an error saying that 'Bin edges must be unique: e: array([ nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 1.81016713]). You can drop duplicate edges by setting the 'duplicates' kwarg' – bing Sep 09 '17 at 19:14
  • 1
    Your `df.X.min()` less than 0, that's why it failed. (It also means you should not plot things like that in log scale) You can "fix" this by shifting your X column or clipping it, but I am not sure if it's reasonable in the context of your problem. – Ryan Tam Sep 09 '17 at 19:39
  • thanks a lot, please see https://stackoverflow.com/questions/46102343/edit-binning-data-scatter-plot-in-python for the update of the question – bing Sep 10 '17 at 20:43
  • Hi Ryan, may I ask how can you shift the X column? – bing Sep 12 '17 at 09:07
  • `df['X'] = df['X'] + df['X'].min() + 1` to offsetthe column such that the new minimum is 1. – Ryan Tam Sep 12 '17 at 10:56