0

I'm trying to make some basic plots so I can better understand what is happening in my data. Currently 1 have 4 variables each with 200*387 data points. I've stored everything in a 3D array, with the 3rd dimension representing different variables associated with the data.

Currently I have produced some scatterplots of var1 vs. var2. However, i would like to add a conditional mean curve on top of this scatterplot. This would be the average var1 (y-axis) value for any given var2 (x-axis) value. However, I am quite new to Python and so am pretty sure that the way I am currently thinking of approaching this is by a long way not the most efficient.

What I'm thinking at the moment is that I can vectorise the data for each variable (i.e. make it 1D) and then create bins of var2 of some reasonable size and then find the average of var1 for each of these bins. I store these averages in some new vector and then plot that.

Is this a very stupid way of doing this? From what I've searched it seems like pandas may have a simple way of doing this but given how new to Python I am I'm also not sure if going straight to pandas would be overkill.

Thank you in advance for any and all responses!

  • 2
    It is not clear to me exactly what you're trying to achieve. Perhaps a tangible example would help, preferably as code that could be copied to play around. Regarding your Pandas question - it is really *not* overkill to use Pandas. It is much friendlier to beginners than Numpy (and built upon it). It could help you e.g. give names to you variables and set them in a 2D multi-index instead of 3D array, which could simplify things for you massively. – Shovalt Aug 10 '20 at 06:11
  • perhaps this [question](https://stackoverflow.com/q/16343752/6692898) and [numpy.where](https://numpy.org/doc/stable/reference/generated/numpy.where.html) can help – RichieV Aug 10 '20 at 07:40
  • What you need is an MCVE. We can fix bad code. We can't fix vague descriptions. – Mad Physicist Aug 11 '20 at 01:41

1 Answers1

-1

Thank you for the responses. Re-reading my question I've realised that it was pretty poorly worded, so my apologies for that.

I found my solution, it was pretty simple in the end. There was no need to use pandas and change data type from arrays to dataframes. I ended up just using the binned_statistics function from scipy. My code was effectively just:

import scipy as sp
n_bins = 80
cond_means, bin_edges, binnumber = sp.stats.binned_statistics(var2, var1, statistic='mean', bins=n_bins)

Where again var2 is the independent (x-axis) variable and var1 is the dependent (y-axis) variable.

For anyone who is also interested in using this for conditional mean plots be aware that binned_statistics provides bin edges, not bin means. This means that you will always have one more bin_edges element than you will have cond_means elements. An easy fix to this is:

bin_width = bin_edges[1] - bin_edges[0]
bin_centres = bin_edges[1:] - bin_width/2

You should now be able to plot your conditional mean simply as:

import matplotlib.pyplot as plt
fig1 = plt.figure()
plt.scatter(var2, var1, color = 'blue', label = 'raw data')
plt.plot(bin_centres, cond_means, color = 'black', label = 'Conditional mean')
plt.legend()
plt.xlabel('var2')
plt.ylabel('var1')
plt.show()