-1

I was trying to plot a histogram for the data from a .csv file. But when I run it, it is very very slow. I waited for like 20 minutes, but still cannot get the plot. May I ask that is the problem?

The following lines are my code.

import pandas as pd
import matplotlib.pyplot as plt

spy = pd.read_csv( 'SPY.csv' )
stock_price_spy = spy.values[ :, 5 ]

n, bins, patches = plt.hist( stock_price_spy, 50 )
plt.show()
Mr. T
  • 11,960
  • 10
  • 32
  • 54
Lithium
  • 41
  • 1
  • 3

3 Answers3

0

Sorry but you are wrong, numpy is far away in performance, numpy uses a type of variable (numpy arrays) that consumes much less memory compared to python lists or other kind of pure python arrays. And it's optimized for vector and matrix operations not to mention the compability with matplotlib. In memory you have close to 75% less memory usage that performs better and that's where the bottleneck resides, because you have to read and parse everything as a python variable on a list.

kny5
  • 34
  • 5
  • Since pandas stores its colums as numpy arrays, I honestly don't get your point here. But in any case this has nothing to do with the question being asked, which can currently simply not be answered, because we don't know the file that is read in. – ImportanceOfBeingErnest Sep 09 '18 at 00:50
0

I did the following, it seems that this can solve the problem.

It seems that " stock_price_spy = spy[ 'Adj Close' ].values " gives a true numpy ndarray.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

spy = pd.read_csv( 'SPY.csv' )
stock_price_spy = spy[ 'Adj Close' ].values

plt.hist( stock_price_spy, bins = 100, label = 'S&P 500 ETF', alpha = 0.8 )
plt.show()
Lithium
  • 41
  • 1
  • 3
-6

In fact, you are using a pretty deficient way to achieve your goal, you need to use numpy in order to increase the performance.

import numpy as np
import matplotlib.pyplot as plt

stock_price_spy = np.loadtxt('SPY.csv', dtype=float, delimiter=',', skiprows=1, usecols=4)

#here you have nothing else than the 5th column of your csv, this cuts the bottleneck in memory.

n, bins, patches = plt.hist( stock_price_spy, 50 )
plt.show()

I didn't test it, but it should work.

And I recommend you to use the optimized version of python from intel. It's better to manage this kind of process. Intel python distribution

Adding code for testing. Because some fellows are trying to misinform and are missing true arguments, panda uses Dataframes which are dictionaries, not numpy arrays. And numpy arrays are almost twice faster.

import numpy as np
import pandas as pd
import random
import csv
import matplotlib.pyplot as plt
import time

#Creating a random csv file 6 x 4871, simulating the problem.
rows = 4871
columns = 6
fields = ['one', 'two', 'three', 'four', 'five', 'six']

write_a_csv = csv.DictWriter(open("random.csv", "w"), 
fieldnames=fields)
for i in range(0, rows):
    write_a_csv.writerow(dict([
    ('one', random.random()),
    ('two', random.random()),
    ('three', random.random()),
    ('four', random.random()),
    ('five', random.random()),
    ('six', random.random())
    ]))

start_old = time.clock()
spy = pd.read_csv( 'random.csv' )
print(type(spy))
stock_price_spy = spy.values[ :, 5 ]
n, bins, patches = plt.hist( stock_price_spy, 50 )

plt.show()
end_old = time.clock()
total_time_old = end_old - start_old
print(total_time_old)

start_new = time.clock()

stock_price_spy_new = np.loadtxt('random.csv', dtype=float, 
delimiter=',', skiprows=1, usecols=4)
print(type(stock_price_spy_new))
#here you have nothing else than the 5th column of your csv, this cuts the bottleneck in memory.

n, bins, patches = plt.hist( stock_price_spy_new, 50 )
plt.show()
end_new = time.clock()

total_time_new = end_new - start_new
print(total_time_new)
kny5
  • 34
  • 5
  • 1
    For some reason, previous comments here got deleted. Still it seems necessary to give this some context for future readers. While this answer does not answer the question, it also contains some flaws. Measuring the timing with a wall clock is always dangerous. Here, it will lead to the method measured first to always appear slower. If using a proper timing measurement one will find instead that for the setup chosen here, the pandas solution is faster. While plotting takes the same time, the difference comes from the read in, where the pandas solution outperforms numpy by a factor of ~2. – ImportanceOfBeingErnest Sep 10 '18 at 00:32