3

I am trying to plot a nice histogram of a big dataset of 3 mln rows (I have 2CPUs/16GB RAM). Even though I provided bins, I never got a plot. Is there more efficient method to plot a histogram? See the code below.

df0 = dd.read_csv(filename, sep="|", header=None, dtype=np.str, error_bad_lines=False, usecols=col0, quoting=3, encoding='ISO-8859-1')
dfs = df0[df0['DocumentTypeStndCode']=='D'].compute()
dfs['Price'] = dfs[pd.to_numeric(dfs['Price'], errors='coerce').notnull()]

sns.distplot(dfs['Price'], bins=[0, 10000, 200000, 400000, 2000000], kde=False)
plt.show()
Anna Ignashkina
  • 467
  • 4
  • 16

1 Answers1

2

This shouldn't be a problem for you. I'm showing a couple seconds to generate the plots given 50 million rows. I tried pandas hist first.

import pandas as pd
import numpy as np
%matplotlib inline
df = pd.DataFrame({
  'values': np.random.beta(0.5, 0.1, size=50000000)
})
hist = df.hist(bins=10)

and same in seaborn

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.distplot(df['values'], bins=[0, .10000, .200000, .400000, 2.000000], kde=False)
plt.show()
James Natale
  • 476
  • 4
  • 9
  • Yeah, I see with your example, that it is not a matter of a package..pretty weird, since I converted all the data in the column into numeric values. Any idea, what can cause such a delay in execution? – Anna Ignashkina Jun 25 '18 at 16:35
  • Are you executing in an ipython notebook, or normal python execution? If the %matplotlib inline is not present, it doesn't show a graph in notebooks. Other than that, I have hit issues when I have multiple graphs. If they are popping up in single windows, you sometimes need to close them to move to the next. – James Natale Jun 25 '18 at 17:05
  • Actually, I am using dask in normal python environment, and I just found out it somehow had messed up my columns, maybe this is the reason. Anyway, its not a problem of a plot engine. – Anna Ignashkina Jun 25 '18 at 17:20