1

When I try to use dask to clean a number of jsonl files it gives me errors saying that the column order is mismatched if I use a partition size of something like 128MB, but works fine if a little slow when I use a size of 512MB

so my code looks something like:

import dask.dataframe as dd
import pandas as pd
from dask.distributed import Client
client = Client(n_workers = 1, threads_per_worker = 4)
meta = dd.utils.make_meta([('x1', 'object'), ('date', 'int64'), ('x2', 'object'])
df = dd.read_json('*.jsonl', blocksize = 2**27, meta = meta)
keep = ['item1', 'item2', 'item3']
df['x1'] = df.x1.str.lower()
df = df[df['x1'].isin(keep)]
df.to_csv('dask_file.csv', single_file) = True

When I run this code I eventually get an error saying that the order of the columns do not match

however, when I run:

df = dd.read_json('*.jsonl', blocksize = 2**29, meta = meta)
keep = ['item1', 'item2', 'item3']
df['x1'] = df[df['x1'].isin(keep)]
df.to_csv(dask_file.csv', single_file = True)

It writes what I need, although the speed is quite a bit slower when I look at the progress. Can anyone help?

Thanks.

NW12901
  • 11
  • 3

1 Answers1

0

My guess is that with the smaller blocksize you end up with partitions that are empty after applying the .isin(keep) filter. If that is the case, you can remove the empty partitions using the function in this answer.

However, in this situation I would prefer dask.bag API, as it's well suited for working with json, see this example. If you want csv output at the end, you can convert the bag to dataframe with .to_dataframe().

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • By empty do you mean 0 rows? because 0 rows did not throw off the code when I asked it to print the number of rows in each partition. It just suddenly died around the 170th partition. – NW12901 Apr 06 '21 at 14:21
  • Yes, I mean zero rows. The number of partitions in each row probably doesn't involve checking of the column consistency, while `.to_csv` with `single_file` option does. My guess is you would not see the error if you set `single_file=False` (but not sure). – SultanOrazbayev Apr 06 '21 at 14:26
  • Alternatively you can check the function in the first link of the updated answer, it will remove empty partitions and your `.to_csv` should work with `single_file=True`. – SultanOrazbayev Apr 06 '21 at 14:28