I have a huge database (of 500GB or so) an was able to put it in pandas. The databasse contains something like 39705210 observations. As you can imagine, python has hard times even opening it. Now, I am trying to use Dask in order to export it to cdv into 20 partitions like this:
import dask.dataframe as dd
dask_merge_bodytextknown5 = dd.from_pandas(merge_bodytextknown5, npartitions=20) # Dask DataFrame has 20 partitions
dask_merge_bodytextknown5.to_csv('df_complete_emakg_*.csv')
#merge_bodytextknown5.to_csv('df_complete.zip', compression={'method': 'zip', 'archive_name': 'df_complete_emakg.csv'})
However when I am trying to drop some of the rows e.g. by doing:
merge_bodytextknown5.drop(merge_bodytextknown5.index[merge_bodytextknown5['confscore'] == 3], inplace = True)
the kernel suddenly stops. So my questions are:
- is there a way to drop the desired rows using Dask (or another way that prevents the crush of the kernel)?
- do you know a way to lighten the dataset or deal with it in python (e.g. doing some basic descriptive statistics in parallel) other than dropping observations?
- do you know a way to export the pandas db as a csv in parallel without saving the n partition separately (as done by Dask)?
Thank you