I'm trying to load a dask dataframe from a 30gb csv file into a 3D barchart using matplotlib.
The problem is the task has been running for days with no end in sight as soon as it gets to the 'color settings' portion of the code.
I have tried to make it use only a limited number of rows from the dataframe but dask doesn't seem to allow for row indexing, only column indexing.
So I split the partition and used the partition size to limit the row size. However even with only 100 rows it takes days.
I have my doubts that it would take days for 100 rows of color settings to be calculated (not even to the plotting portion yet)
So clearly I am doing something wrong.
Here is what the dataframe looks like
Here is the code
%matplotlib inline
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import numpy as np
import matplotlib as mpl
from matplotlib import cm
import pandas as pd
# import ipynb.fs.full.EURUSD
from colorspacious import cspace_converter
from collections import OrderedDict
import numpy as np
import os
import dask
import dask.array as da
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from memory_profiler import memory_usage
import memory_profiler
%load_ext memory_profiler
cmaps = OrderedDict()
df = dd.read_csv(r'G:\Forex stuff\ticks2\Forex\EURUSD_mt5_ticks.csv')
npart = round(len(df)/1000)
parted_df = df.repartition(npartitions=npart)
first_1000_rows = parted_df.partitions[0]
first_1000_rows.head(100)
ylist = df["Bid_Price"]
xlist = df["Date"]
zlist = df["Bid_Volume"]
xpos = xlist
ypos = ylist
num_elements = len(first_1000_rows)
zpos = np.zeros(num_elements)
dx = np.ones(num_elements)
dy = np.ones(num_elements)
dz = zlist
from dask.distributed import Client
client = Client("tcp://10.0.0.98:8786")
client.cluster
#color settings
cmap = cm.get_cmap('Spectral') # Get desired colormap - you can change this!
max_height = np.max(dz) # get range of colorbars so we can normalize
min_height = np.min(dz)
#scale each z to [0,1], and get their rgb values
rgba = [cmap((k-min_height)/max_height) for k in dz]
fig = plt.figure(figsize=(20, 20))
ax1 = fig.add_subplot(111, projection='3d')
ax1.bar3d(xpos, ypos, zpos, dx, dy, dz, color=rgba, zsort='average')
plt.show()
I often get timeout and instability errors during the run. But the dask GUI shows me that it is working. However I clearly am not running on only 100 rows since it is taking so long.
I am fairly sure it is just looping the same tasks over and over again since the dask dumps like 10 gigs of cluster data from ram at the end of every cycle within the GUI and then resets.
Any ideas how I can improve? Appreciated.