1

Number of file : 894 total file size : 22.2GB I have to do machine learning by reading many csv files. There is not enough memory to read at once.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
HOON
  • 33
  • 4
  • What sort of machine learning? Many algorithms particularly neural network-based models use batch learning so don't need to load the entire file into memory, and generally don't use pandas – David Waterworth Apr 06 '22 at 02:09
  • I'm going to use a supervised learning algorithm. – HOON Apr 06 '22 at 02:30
  • I need to learn 894 files, but I am thinking about how to efficiently load them. – HOON Apr 06 '22 at 02:31
  • Yes but there's loads of different supervised learning algorithms, that's a generic term. If you don't have enough memory to load the entire dataset then you need to pick one that implements batched data loading. – David Waterworth Apr 06 '22 at 02:42
  • To improve your question: surely your task is not to *read* your files. What is actually your task? – mdurant Apr 06 '22 at 16:41

2 Answers2

2

Specifically to load a large number of files that do not fit in memory, one can use dask:

import dask.dataframe as dd
df = dd.read_csv('file-*.csv')

This will create a lazy version of the data, meaning the data will be loaded only when requested, e.g. df.head() will load the data from the first 5 rows only. Where possible pandas syntax will apply to dask dataframes.

For machine learning you can use dask-ml which has tight integration with sklearn, see docs.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
1

You can read your files in chunks but not during the training phase. You have to select an appropriate algorithm for your files. However, having such big files for model training mostly means you have to do some data preparation first, which will reduce the size of the files significantly.

Esraa Abdelmaksoud
  • 1,307
  • 12
  • 25