2

I have a huge database (of 500GB or so) an was able to put it in pandas. The databasse contains something like 39705210 observations. As you can imagine, python has hard times even opening it. Now, I am trying to use Dask in order to export it to cdv into 20 partitions like this:

import dask.dataframe as dd
dask_merge_bodytextknown5 = dd.from_pandas(merge_bodytextknown5, npartitions=20)  # Dask DataFrame has 20 partitions

dask_merge_bodytextknown5.to_csv('df_complete_emakg_*.csv')
#merge_bodytextknown5.to_csv('df_complete.zip', compression={'method': 'zip', 'archive_name': 'df_complete_emakg.csv'})

However when I am trying to drop some of the rows e.g. by doing:

merge_bodytextknown5.drop(merge_bodytextknown5.index[merge_bodytextknown5['confscore'] == 3], inplace = True)

the kernel suddenly stops. So my questions are:

  1. is there a way to drop the desired rows using Dask (or another way that prevents the crush of the kernel)?
  2. do you know a way to lighten the dataset or deal with it in python (e.g. doing some basic descriptive statistics in parallel) other than dropping observations?
  3. do you know a way to export the pandas db as a csv in parallel without saving the n partition separately (as done by Dask)?

Thank you

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
Lusian
  • 629
  • 1
  • 5
  • 11
  • Unless I'm missing something... are you basically wanting a DF (or file) where `conf_score == 3` ? – Jon Clements Jun 29 '22 at 14:53
  • Nope, one in which I delete the rows in which confscore==3 or confscore==4 without the kernel to stop. – Lusian Jun 29 '22 at 15:03
  • Okay... so to check... you want to take a 500gb CSV file... exclude certain conf_score's and produce a subset CSV file... or....? Also - where is the DB coming from... is it on disk... in an RDMS or similar or...? – Jon Clements Jun 29 '22 at 15:11
  • yes exactly (remember that I have already the pandas version of it called merge_bodytextknown5). A subset where observations having confscore==3 or confscore==4 are excluded. Furthermore, please if you have also suggestions on points 2. and 3. they would be very appreciated. Thanks! – Lusian Jun 29 '22 at 15:12
  • What work (if any) do you want to do in pandas? Almost seems like a simpler route of just reading directly as a CSV would be better... – Jon Clements Jun 29 '22 at 15:17
  • In pandas, I would like to take a subset where observations having confscore==3 or confscore==4 are excluded. The problem is that if I do like I did (i.e. merge_bodytextknown5.drop(merge_bodytextknown5.index[merge_bodytextknown5['confscore'] == 3], inplace = True)) the kernel stops. – Lusian Jun 29 '22 at 15:27
  • let's say operation succeeded, then what comes next? – Paul H Jun 29 '22 at 16:43
  • Then I would like to lighten the database and make descriptive stats on it. But for the moment I would be very happy to make it an easy to handle database. Every operation is taking a lot to be carried out. I was asking myself if there was something like compress in stata. – Lusian Jun 29 '22 at 17:17
  • 1
    See [this post](https://stackoverflow.com/a/69153327/3888719) for some suggestions for speeding up pandas and dask csv parsing performance. Best bet is to not store as csv. Parquet or another columnar format would probably be a good choice. – Michael Delgado Jun 30 '22 at 06:47
  • @MichaelDelgado just a quick question related to that: does saving objects as strings save memory? I have noticed that strings are stored as objects in my df so I wondered whether something like df["col"] = df["col"].astype("|S") for each column (col) stored as an "object" might be of any help in reducing memory storage. Thank you. – Lusian Jun 30 '22 at 09:36
  • 1
    No no the opposite is true. Because csvs don’t include type information, pandas and dask have to guess at the types of the columns when reading. This consumes lots of memory. But numeric columns are much faster to work with and tend to have smaller memory footprints. You should parse your types and convert your data from csv to a binary format such as parquet once and then use the binary data from then on if possible. – Michael Delgado Jun 30 '22 at 14:35
  • 1
    Columns show up as objects in pandas when type inference fails (not all values can be interpreted as a numeric type). Give that post I sent a good read - you really need to watch out for this issue if your data is this large. Check out the section on specifying static types and nan handling. – Michael Delgado Jun 30 '22 at 14:41

1 Answers1

3

Dask dataframes do not support the inplace kwarg, since each partition and subsequent operations are delayed/lazy. However, just like in Pandas, it's possible to assign the result to the same dataframe:

df = merge_bodytextknown5  # this line is for easier readability
mask = df['confscore'] != 3  # note the inversion of the requirement

df = df[mask]

If there are multiple conditions, mask can be redefined, for example to test two values:

mask = ~df['confscore'].isin([3,4])

Dask will keep track of the operations, but, crucially, will not launch computations until they are requested/needed. For example, the syntax to save a csv file is very much pandas-like:

df.to_csv('test.csv', index=False, single_file=True) # this save to one file

df.to_csv('test_*.csv', index=False) # this saves one file per dask dataframe partition
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46