0

I have a huge data-set. It have almost 300k rows. I want to split it into two halves. Each of them containing 150k rows. Is it possible to do that using dask ?

Amzzz
  • 45
  • 1
  • 1
  • 7
  • Just create a `dask` dataframe. Then use `numpy` to split the df into two. Check [`this`](https://stackoverflow.com/questions/41624241/pandas-split-dataframe-into-two-dataframes-at-a-specific-row). – Mayank Porwal Mar 23 '21 at 05:15
  • Its just selecting some columns right? I want to select rows. I have already tried.`df1 = df.iloc[:72, :] df2 =df.iloc[72:, :]` – Amzzz Mar 23 '21 at 05:22
  • Check [`this`](https://stackoverflow.com/questions/17315737/split-a-large-pandas-dataframe). This splits on rows. – Mayank Porwal Mar 23 '21 at 05:23
  • But its not working. I got this error. NotImplementedError: 'DataFrame.iloc' only supports selecting columns. It must be used like 'df.iloc[:, column_indexer]'. – Amzzz Mar 23 '21 at 05:25
  • Check my 2nd link. Use `np.array_split`, not `iloc`. – Mayank Porwal Mar 23 '21 at 05:26
  • Its also showning error.TypeError: unsupported operand type(s) for divmod(): 'Delayed' and 'int' – Amzzz Mar 23 '21 at 05:27
  • You can also use `df = df.repartition(divisions=2)` to have the dataframe split in 2 equally sized partitions. See [the API docs](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.repartition) for more. – ddavis Mar 23 '21 at 15:01
  • @ddavis : this is the real answer here; you could flesh it out with an example. It is better to have answers than comments. OP : you should always show how you set up your problem, what you tried, and in what manner it is failing. – mdurant Mar 23 '21 at 18:23

1 Answers1

0

Dask is designed to have partitioned DataFrames. Let's take the example where you have any number of CSV files in some directory called /path/ (if your whole dataset of 300k rows lives in many CSV files in some directory, this will create two partitions of size 150k):

import dask.dataframe as dd
df = dd.read_csv("/path/*.csv").repartition(npartitions=2)

If you already have your dataframe from some previous work of course you can just use

df = df.repartition(npartitions=2)

Going back to CSVs as an example, if you have exactly two CSV files the call to repartition is redundant. Keep in mind that repartitioning will add additional steps to the dask task graph, you may want to investigate the performance cost/benefit of repartitioning a dataframe after initializing it. Check out the diagnostics part of the documentation for more

ddavis
  • 337
  • 5
  • 15
  • How can I save the partitions as separate csv files? – Amzzz Mar 24 '21 at 08:18
  • Skipping line 4: ',' expected after '"' Skipping line 41223: unexpected end of data Skipping line 41120: unexpected end of data Skipping line 47125: unexpected end of data Skipping line 173: ',' expected after '"' Skipping line 176: ',' expected after '"' --------------------------------------------------------------------------- MemoryError Im getting an error like this while trying to save df as a csv file – Amzzz Mar 24 '21 at 09:28
  • See the [API documentation for `dask.dataframe.to_csv`](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.to_csv), Dask will save the partitions into separate csv files by default if you use a wildcard (`*`) in the output name. – ddavis Mar 24 '21 at 18:08