1

I have a large collection of data in a structure that corresponds to the above function sklearn.datasets.load_file. I want to load the dataset and fit a basic classification model. I thought something like this would suit the task:

import numpy as np
import sklearn.datasets
from sklearn.ensemble import RandomForestClassifier

dataset = sklearn.datasets.load_files("data", load_content = 'False') # my dataset cannot be loaded into the memory 

model = RandomForestClassifier(n_estimators=100)
model.fit(dataset.data, dataset.target)

But I received an error:

ValueError: could not convert string to float: b'\x93NUMPY\x01\x00v\x00{\'descr\': \'<f8\', \'fortran_order\': False, \'shape\': (115000,), }                                                       \n\x00\x00\x00 \xf2zY?\x00\x00\x00\x00\xd8pp?\x00\x00\x00@6\xbc\x88?\x00\x00\x00@\xad9e?\x00\x00\x00\xc0\t\x1ep?\x00\x00\x00`\x1e\xf9\x8f?\x00\x00\x00\xe0!#q?\x00\x00\x00`\xb8#S\xbf\x00\x00\x00@\xb55x?\x00\x00\x00 Jp}?\x00\x00\x00 P\xdbr\xbf\x00\x00\x00@\r\xf8u\xbf\x00\x00\x00\xc0fnX?\x00\x00\x00`YI-?\x00\x00\x00\xc0\xca~f?\x00\x00\x00\xa0\xb2\xe1W\xbf\x00\x00\x00`\x8a\xcdQ\xbf\x00\x00\x00\x80\x97\x1ec\xbf\x00\x00\x00\xe0\xe4\xc1z\xbf\x00\x00\x00@\xacCR?\x00\x00\x00`\nkt?\x00\x00\x00`\xee\xf9p\xbf\x00\x00\x00\x007/w\xbf\x00\x00\x00`e\xc4x\xbf\x00\x00\x00@\xff\x84{\xbf\x00\x00\x00\xe08vk\xbf\x00\x00\x00 \xd9\x1de\xbf\x00\x00\x00\xe0\xe8YG\xbf\x00\x00\x00\x80k\xf1u\xbf\x00\x00\x00@V\xd8\x91\xbf\x00\x00\x00 9\xb1\x8f\xbf\x00\x00\x00\xe0f\x9dL?\x00\x00\x00@\xa7\xe4p\xbf\x00\x00\x00 \xb4\xc0~\xbf\x00\x00\x00\xc0\xb4\xe4\x83\xbf\x00\x00\x00\xc0\xef2\x90\xbf\x00\x00\x00\xe0\x90]\x86\xbf\x00\x00\x00@f\xb6p\xbf\x00\x00\x00\xc0X\xd0|\xbf\x00\x00\x00\x00\xaeq\x8f\xbf\x00\x00\x00\xc0\xba\xd7\x89\xbf\x00\x00\x00\xe0mw\x91\xbf\x00\x00\x00`[\xb9\x8f\xbf\x00\x00\x00@\xa0\xad\x8b\xbf\x00\x00\x00`h\xd3\x94\xbf\x00\x00\x00\xe0-c\x86\xbf\x00\x00\x00\xc0>9\x82\xbf\x00\x00\x00\xe0\x90\xbe\x91\xbf\x00\x00\x00\xa0\xce\x17\x8e\xbf\x00\x00\x00\xa0\x01\t\x8f\xbf\x00\x00\x00\xa0\xac}\x95\xbf\x00\x00\x00\xe0\x1e\x0c\x8f\xbf\x00\x00\x00\xa0\xdc\xcb\x90\xbf\x00\x00\x00\xc0\n\x0f\x96\xbf\x00\x00\x00\xc0\xba\x8a\x8b\xbf\x00\x00\x00`\x10\xe7\x95\xbf\x00\x00\x00\x00\x1ds\x9a\xbf\x00\x00\x00 \xbew\x94\xbf\x00\x00\x00\xa0\xcfl\x94\xbf\x00\x00\x00\x00J\x84\x92\xbf\x00\x00\x00\x80\xce\x8b\x97\xbf\x00\x00\x00\x80/|\x99\xbf\x00\x00\x00\xc0\xd7\x9a\x99\xbf\x00\

Loading the files this way apparently does not know how to handle NumPy files. What options do we have?

I'm currently converting all NumPy files to text files, but this triples or quadruples the volume of data. Is there a different way to load the materials rather than training a simple model based on vectors saved as NumPy files?

Yanirmr
  • 923
  • 8
  • 25
  • I will create a data loader generator that will load data in small chunks. With that you should use models that allow `partial_fit`, which you would train in batches. – Prayson W. Daniel Jul 26 '21 at 12:53
  • 1
    @PraysonW.Daniel Can you share the code for that, please? I assume it won't work for the random forest but XGboost is fine. – Yanirmr Jul 26 '21 at 13:02
  • Yes, I can. Can you tell me more about your data? Is it on DB. Can we load/read it in chunks with Pandas or NumPy? – Prayson W. Daniel Jul 26 '21 at 13:13

1 Answers1

1

I will create a data loader generator that will load data in small chunks. With that you should use models that allow partial_fit, which you would train in batches.

My flow will look like:


import numpy as np
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier




# very large file
FILE_PATH = ...
FEATURES_COLUMNS = ...
TARGET_COLUMN = ...
CHUNK_SIZE = 100_000



reader = pd.read_csv(FILE_PATH, chunksize=CHUNK_SIZE, low_memory=False)
clf = SGDClassifier(loss='log') # models with partial fit

    
for batch_number, dataf_chunk in enumerate(reader, start=1):
    
    # logic to get X (features) and y (target) from data chunk
    X, y = dataf_chunk[FEATURES_COLUMNS], dataf_chunk[TARGET_COLUMN]
    
    # splits to track model performance
    X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, random_state=42, stratify=y)
    
    # [custom] preprocessors that allow data update e.g HashingVectorizer
    ...
    
    # model training per batches
    clf.partial_fit(X_train, y_train, classes=np.unique(y_train))
    
    print(f"Batch number {batch_number} | Model Scores:"
          f"Train score = {clf.score(X_train, y_train) : .2%}|"
          f"Test score = {clf.score(X_test, y_text) : .2%}")
    

There is equivalent to chuck reading in NumPy too. See Working with big data in python and numpy, not enough ram, how to save partial results on disc?

Update:

I discovered that scikit-learn documentation has a similar example using yield example of partial training

Prayson W. Daniel
  • 14,191
  • 4
  • 51
  • 57