1

I have a large dataset(50k rows) and I wanted to create a histogram from the data with density along the Y axis and values log scaled on the x axis, with a KDE plot superimposed.

This is very small subset of the data being used:

A       B    C
1       1   4200
1       4   94000
1       4   81000
1       3   30000
1       3   29000
1       1   20400


Current code:
columns= ['A','B','C']
df=pd.read_csv('data.csv', skipinitialspace=True, usecols=columns)

data=df[['C']].dropna().values
data=np.logspace(data)
plt.hist(data, bins='auto') 

I currently get the following error logspace() missing 1 required positional argument: 'stop' When I don't use logspace I am able to get a histogram, but not the one I am looking for. I am very new to python so the help is appreciated.

user11861166
  • 159
  • 7

1 Answers1

2

np.logspace works like np.linspace and is used to create an array that is evenly spaced just in log space. np.logspace takes a start value and an end value to create an array and you can check the documentation here. It does not take the log of your data. You will want to use np.log for that.

BenT
  • 3,172
  • 3
  • 18
  • 38
  • Thanks for the explanation let me rephrase what I am asking. We have a hint to use log-spaced bins on the X-axis so we should use np.logspace for this. I am new to statistics and need help in how I would choose the start and end for this function as I am also unfamiliar on how to choose the number of bins. – user11861166 Aug 11 '19 at 20:05
  • You want to create your bins using `np.logspace` then for your histogram. Something like `bins = np.logspace(np.min(data),np.max(data))` then `plt.hist(data,bins=bins)` – BenT Aug 11 '19 at 21:16
  • I am getting the following errors: RuntimeWarning: overflow encountered in power return _nx.power(base, y), RuntimeWarning: invalid value encountered in subtract a = op(a[slice1], a[slice2]), RuntimeWarning: invalid value encountered in multiply boffset = -0.5 * dr * totwidth * (1 - 1 / nx). My data contains both negatives and 0 as possible values. – user11861166 Aug 11 '19 at 21:34
  • Then you have answered your own problem... You can't use log if you have negative numbers or zero so you need to use .01 for your `start` value which means your histogram won't count those numbers. – BenT Aug 11 '19 at 21:55
  • Thanks this helped me with the histogram portion. Now I am trying to add the KDE on top. I have the KDE seperate using kde=df.data.plot.kde(). I need the combine these 2 types of visualization now. – user11861166 Aug 11 '19 at 22:24
  • I have referenced this thread already and while it is helpful. I am having trouble changing the X-axis to a log scale when following along with that thread. – user11861166 Aug 11 '19 at 22:36
  • https://stackoverflow.com/questions/773814/plot-logarithmic-axes-with-matplotlib-in-python – BenT Aug 11 '19 at 23:46
  • If you are having additional problems please update your question and/or post a new question referencing your issue (assuming you can't find help in another SO post) – BenT Aug 11 '19 at 23:52