How can I do dask_ml preprocessing in a dask distributed cluster? My dataset is about 200GB and Every time I categorize the dataset preparing for OneHotEncoding, it looks like dask is ignoring the client and try to load the dataset in the local machine's memory. Maybe I miss something:
from dask_ml.preprocessing import Categorizer, DummyEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
import pandas as pd
import dask.dataframe as dd
df = dd.read_csv('s3://some-bucket/files*.csv', dtypes={'column': 'category'})
pipe = make_pipeline(
Categorizer(),
DummyEncoder(),
LogisticRegression(solver='lbfgs')
)
pipe.fit(df, y)