When I try to use dask to clean a number of jsonl files it gives me errors saying that the column order is mismatched if I use a partition size of something like 128MB, but works fine if a little slow when I use a size of 512MB
so my code looks something like:
import dask.dataframe as dd
import pandas as pd
from dask.distributed import Client
client = Client(n_workers = 1, threads_per_worker = 4)
meta = dd.utils.make_meta([('x1', 'object'), ('date', 'int64'), ('x2', 'object'])
df = dd.read_json('*.jsonl', blocksize = 2**27, meta = meta)
keep = ['item1', 'item2', 'item3']
df['x1'] = df.x1.str.lower()
df = df[df['x1'].isin(keep)]
df.to_csv('dask_file.csv', single_file) = True
When I run this code I eventually get an error saying that the order of the columns do not match
however, when I run:
df = dd.read_json('*.jsonl', blocksize = 2**29, meta = meta)
keep = ['item1', 'item2', 'item3']
df['x1'] = df[df['x1'].isin(keep)]
df.to_csv(dask_file.csv', single_file = True)
It writes what I need, although the speed is quite a bit slower when I look at the progress. Can anyone help?
Thanks.