-1

I'm trying to load a dask dataframe from a 30gb csv file into a 3D barchart using matplotlib.

The problem is the task has been running for days with no end in sight as soon as it gets to the 'color settings' portion of the code.

I have tried to make it use only a limited number of rows from the dataframe but dask doesn't seem to allow for row indexing, only column indexing.

So I split the partition and used the partition size to limit the row size. However even with only 100 rows it takes days.

I have my doubts that it would take days for 100 rows of color settings to be calculated (not even to the plotting portion yet)

So clearly I am doing something wrong.

Here is what the dataframe looks like Dataframe

Here is the code

   %matplotlib inline
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import numpy as np
import matplotlib as mpl
from matplotlib import cm
import pandas as pd
# import ipynb.fs.full.EURUSD
from colorspacious import cspace_converter
from collections import OrderedDict
import numpy as np
import os
import dask
import dask.array as da
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from memory_profiler import memory_usage
import memory_profiler
%load_ext memory_profiler
cmaps = OrderedDict()


df = dd.read_csv(r'G:\Forex stuff\ticks2\Forex\EURUSD_mt5_ticks.csv')

npart = round(len(df)/1000)
parted_df = df.repartition(npartitions=npart)

first_1000_rows = parted_df.partitions[0]

first_1000_rows.head(100)

ylist = df["Bid_Price"]
xlist = df["Date"]
zlist = df["Bid_Volume"]

xpos = xlist
ypos = ylist

num_elements = len(first_1000_rows)

zpos = np.zeros(num_elements)
dx = np.ones(num_elements)
dy = np.ones(num_elements)
dz = zlist

from dask.distributed import Client

client = Client("tcp://10.0.0.98:8786")
client.cluster

#color settings
cmap = cm.get_cmap('Spectral') # Get desired colormap - you can change this!
max_height = np.max(dz)   # get range of colorbars so we can normalize
min_height = np.min(dz)
#scale each z to [0,1], and get their rgb values
rgba = [cmap((k-min_height)/max_height) for k in dz] 


fig = plt.figure(figsize=(20, 20))
ax1 = fig.add_subplot(111, projection='3d')
ax1.bar3d(xpos, ypos, zpos, dx, dy, dz, color=rgba, zsort='average')
plt.show()

I often get timeout and instability errors during the run. But the dask GUI shows me that it is working. However I clearly am not running on only 100 rows since it is taking so long.

I am fairly sure it is just looping the same tasks over and over again since the dask dumps like 10 gigs of cluster data from ram at the end of every cycle within the GUI and then resets.

Any ideas how I can improve? Appreciated.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
coinmaster
  • 133
  • 6
  • **[Don't Post Screenshots](https://meta.stackoverflow.com/questions/303812/)**. Please see [How to provide a reproducible copy of your DataFrame using `df.head(30).to_clipboard(sep=',')`](https://stackoverflow.com/q/52413246/7758804), then **[edit] your question**, and paste the clipboard into a code block. Always provide a [mre] **with code, data, errors, current output, and expected output, as [formatted text](https://stackoverflow.com/help/formatting)**. If relevant, plot images are okay. – Trenton McKinney Aug 11 '21 at 18:18
  • It doesn't seem like your dask client is doing anything? Is this your whole code? – Kaia Aug 12 '21 at 00:13
  • In any case, `first_1000_rows.head(100)` doesn't modify in-place, it returns a new data frame with the top 100 rows. You want something like `first_100_rows = first_1000_rows.head(100)`. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html for more – Kaia Aug 12 '21 at 00:14
  • Yes, sorry I forgot about that. However the calculations are taking an enormous amount of time to complete regardless. If you scroll down in the code window you will see I have enabled the cluster towards the end of the code. I have attempted many variations but nothing seems to work. – coinmaster Aug 12 '21 at 19:07

1 Answers1

0

These lines use the original df, you can check the size of these lists:

ylist = df["Bid_Price"]
xlist = df["Date"]
zlist = df["Bid_Volume"]

You want this:

ylist = parted_df["Bid_Price"]
xlist = parted_df["Date"]
zlist = parted_df["Bid_Volume"]
Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
  • Hmmmm. Maybe you're right. I had assumed the num_elements would split it up for me. I had spent so much time looking for syntax for row indexing that I never considered that possibility. I'll try it and let you now :) – coinmaster Aug 11 '21 at 19:06
  • I don't know what that did exactly, but the process to complete it would have taken like a month based on what the GUI was telling me. Back to the drawing board I think? – coinmaster Aug 11 '21 at 19:35