Python matplotlib histogram very slow

Question

I was trying to plot a histogram for the data from a .csv file. But when I run it, it is very very slow. I waited for like 20 minutes, but still cannot get the plot. May I ask that is the problem?

The following lines are my code.

import pandas as pd
import matplotlib.pyplot as plt

spy = pd.read_csv( 'SPY.csv' )
stock_price_spy = spy.values[ :, 5 ]

n, bins, patches = plt.hist( stock_price_spy, 50 )
plt.show()

`stock_price_gs` is not defined in your code. Do you mean `plt.hist( stock_price_spy, 50 )` instead? — ImportanceOfBeingErnest, Sep 08 '18 at 21:47
In that case the code should be correct and the figure should be generated within some fraction of a second. However you never actually ask it to show, do you? `plt.show()` or are you using a notebook? — ImportanceOfBeingErnest, Sep 08 '18 at 21:50
Even add plt.show(), still, very very slow. Adding it does not solve the problem — Lithium, Sep 08 '18 at 21:53
By "slow" you mean the figure is eventually shown, or does it not ever show? Is there any error generated? — ImportanceOfBeingErnest, Sep 08 '18 at 21:58
I waited for 40 minutes, but still cannot see the plot. I then changed to this, n, bins, patches = plt.hist( stock_price_spy.tolist(), 50 ). Then it showed right away — Lithium, Sep 08 '18 at 22:06
I guess something went wrong reading the file in. But one cannot know what that is. — ImportanceOfBeingErnest, Sep 08 '18 at 22:18
You may want to paste the first 10 lines of your file in the question, such that one may help you further. — ImportanceOfBeingErnest, Sep 09 '18 at 01:11
Another option (accidentally reading strings) here: https://stackoverflow.com/a/60410593/380316 — Oded Ben Dov, Feb 26 '20 at 09:27

score 0 · Answer 1 · answered Sep 09 '18 at 00:44

0

Sorry but you are wrong, numpy is far away in performance, numpy uses a type of variable (numpy arrays) that consumes much less memory compared to python lists or other kind of pure python arrays. And it's optimized for vector and matrix operations not to mention the compability with matplotlib. In memory you have close to 75% less memory usage that performs better and that's where the bottleneck resides, because you have to read and parse everything as a python variable on a list.

answered Sep 09 '18 at 00:44

kny5

34
5

Since pandas stores its colums as numpy arrays, I honestly don't get your point here. But in any case this has nothing to do with the question being asked, which can currently simply not be answered, because we don't know the file that is read in. – ImportanceOfBeingErnest Sep 09 '18 at 00:50

score 0 · Answer 2 · answered Sep 09 '18 at 14:30

I did the following, it seems that this can solve the problem.

It seems that " stock_price_spy = spy[ 'Adj Close' ].values " gives a true numpy ndarray.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

spy = pd.read_csv( 'SPY.csv' )
stock_price_spy = spy[ 'Adj Close' ].values

plt.hist( stock_price_spy, bins = 100, label = 'S&P 500 ETF', alpha = 0.8 )
plt.show()

kny5 · Answer 3 · 2018-09-09T06:48:52.437

In fact, you are using a pretty deficient way to achieve your goal, you need to use numpy in order to increase the performance.

import numpy as np
import matplotlib.pyplot as plt

stock_price_spy = np.loadtxt('SPY.csv', dtype=float, delimiter=',', skiprows=1, usecols=4)

#here you have nothing else than the 5th column of your csv, this cuts the bottleneck in memory.

n, bins, patches = plt.hist( stock_price_spy, 50 )
plt.show()

I didn't test it, but it should work.

And I recommend you to use the optimized version of python from intel. It's better to manage this kind of process. Intel python distribution

Adding code for testing. Because some fellows are trying to misinform and are missing true arguments, panda uses Dataframes which are dictionaries, not numpy arrays. And numpy arrays are almost twice faster.

import numpy as np
import pandas as pd
import random
import csv
import matplotlib.pyplot as plt
import time

#Creating a random csv file 6 x 4871, simulating the problem.
rows = 4871
columns = 6
fields = ['one', 'two', 'three', 'four', 'five', 'six']

write_a_csv = csv.DictWriter(open("random.csv", "w"), 
fieldnames=fields)
for i in range(0, rows):
    write_a_csv.writerow(dict([
    ('one', random.random()),
    ('two', random.random()),
    ('three', random.random()),
    ('four', random.random()),
    ('five', random.random()),
    ('six', random.random())
    ]))

start_old = time.clock()
spy = pd.read_csv( 'random.csv' )
print(type(spy))
stock_price_spy = spy.values[ :, 5 ]
n, bins, patches = plt.hist( stock_price_spy, 50 )

plt.show()
end_old = time.clock()
total_time_old = end_old - start_old
print(total_time_old)

start_new = time.clock()

stock_price_spy_new = np.loadtxt('random.csv', dtype=float, 
delimiter=',', skiprows=1, usecols=4)
print(type(stock_price_spy_new))
#here you have nothing else than the 5th column of your csv, this cuts the bottleneck in memory.

n, bins, patches = plt.hist( stock_price_spy_new, 50 )
plt.show()
end_new = time.clock()

total_time_new = end_new - start_new
print(total_time_new)

For some reason, previous comments here got deleted. Still it seems necessary to give this some context for future readers. While this answer does not answer the question, it also contains some flaws. Measuring the timing with a wall clock is always dangerous. Here, it will lead to the method measured first to always appear slower. If using a proper timing measurement one will find instead that for the setup chosen here, the pandas solution is faster. While plotting takes the same time, the difference comes from the read in, where the pandas solution outperforms numpy by a factor of ~2. — ImportanceOfBeingErnest, Sep 10 '18 at 00:32

Python matplotlib histogram very slow

3 Answers3

Adding code for testing. Because some fellows are trying to misinform and are missing true arguments, panda uses Dataframes which are dictionaries, not numpy arrays. And numpy arrays are almost twice faster.