0

We are trying to load IDS-2018 dataset, it consists of 10 CSV files with a total of 6.4 GB. When we tried concat all the CSV files in a 32GB RAM server, it's crashing (Process is Killed).

We even tried optimizing the storage space in a pandas data frame by using,


def reduce_mem_usage(df):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

But there is no use. Server is still crashing while concatenating each of the CSV files. We have concatenated each of the files using pd.concat. whole code is here. How to achieve this so that we could do further processing??

Akash Tadwai
  • 100
  • 1
  • 9
  • Are there any limitations on your server that limit memory usage of a specific process? I'm wondering if it's not the machine running out memory but the OS killing the process because it won't let it have more – sedavidw May 06 '21 at 14:54
  • I don't think there are any such limitations, btw in Kaggle kernel too its collapsing. – Akash Tadwai May 06 '21 at 15:49
  • `reduce_mem_usage` is not working, since you only convert the values of columns, and assign the new values to the same memory. The main difference between float64 and float32 is not number range but precision. – Daniel May 07 '21 at 06:17
  • @Daniel can you say any fix for this issue if possible? isn't the memory released by python's garbage collector when it's not used? – Akash Tadwai May 07 '21 at 11:54

1 Answers1

0

I would try the following:

  • Specifying column types on read_csv via the dtypes argument.
  • Not creating 10 dataframes and rely on del.
import numpy as np
import pandas as pd

data_files = [
    './data/CSVs/02-14-2018.csv',
    './data/CSVs/02-15-2018.csv',
    ... # a few more
]

# define dtypes
data_types = {
  "col_a": np.float64,
  ... # other types
}

df = reduce_memory_usage(
    pd.read_csv(filename[0], dtype=data_types, index_col=False)
)
for filename[1:] in data_files:
    df = pd.concat(
        [
            df,
            reduce_mem_usage(
                pd.read_csv(
                    filename,
                    dtype=data_types,
                    index_col=False,
                )
            ),
        ],
        ignore_index=True,
    )

This way you make sure the type inference is exactly what you needed to be and reduce the memory footprint. Also, if you have categorical columns on your data which are usually encoded as strings on CSV files you can greatly reduce memory footprint by using a categorical column data type.

GuillemB
  • 540
  • 4
  • 13
  • As there are nearly 80 columns, I have directly used `dtypes_of_0 = d0.dtypes.to_dict()`. The method u posted just concats without creating new data frames right? Does this make much difference?? I'll test it soon, and let u know if this works. Thanks! – Akash Tadwai May 06 '21 at 15:53
  • I just checked. still the process is being killed. I am out of ideas :( – Akash Tadwai May 06 '21 at 16:36
  • Have you checked memory consumption while running the script? Are you sure lack of memory is the problem. If that is the case another layer of.memory optimization would be to turn categorical columns into the categorical dtype. See https://stackoverflow.com/questions/39092067/pandas-dataframe-convert-column-type-to-string-or-categorical You would have to hunt for those columns manually but it can make a huge difference. – GuillemB May 06 '21 at 21:19
  • Yes, I have checked memory consumption using `free -h` while script is running, whole virtual memory is being exhausted by the process and terminal became unresponsive after sometime. And the original dtypes of all the columns are inferred as "object" when I load through pandas. So, how should I know manually that they aren't any mixed dtypes in columns. And write exact dtypes of each of columns. – Akash Tadwai May 07 '21 at 04:43
  • If everything is inferred as `object` the `reduce_mem_usage` won't do anything for you. If you really have to load all of this I would start small and load just a couple of columns at a time, say 5, using the `use_cols` argument to `read_csv`. Then dump those 5 columns into a parquet file. Sort of faking a columnar store. Now you can at least look at your data a decide on dtype conversions. Great summary on that here: https://stackoverflow.com/questions/15891038/change-column-type-in-pandas . Also, why don't you have an `index_col=False` on `d0`? I've modified my answer. – GuillemB May 07 '21 at 06:03
  • Thanks!! Actually the data was corrupted in some rows with header repeated in some rows. which was making all the types of cols to be object, because of which the process is consuming soo much memory. After removing those and adding dtypes it worked fine! Thanks! – Akash Tadwai May 07 '21 at 13:14